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Systems on silicon show a continuous increase in complexity due to the ever 
increasing need for implementing new features and improvements of existing 
functions. This is enabled by the increasing density with which components can be 
integrated on an integrated circuit. At the same time the clock speed at which circuits 
are operated tends to increase too. The higher clock speed in combination with the 
increased density of components has reduced the area which can operate 
synchronously within the same clock domain. This has created the need for a modular 
approach. According to such an approach the processing system comprises a plurality 
of relatively independent, complex modules. In conventional processing systems the 
systems modules usually communicate to each other via a bus. As the number of 
modules increases however, this way of communication is no longer practical for the 
following reasons. On the one hand the large number of modules forms a too high bus 
load. On the other hand the bus forms a communication bottleneck as it enables only 
one device to send data to the bus . A communication network forms an effective way 
to overcome these disadvantages. The communication network comprises a plurality 
of partly connected nodes. Messages from a module are redirected by the nodes to one 
or more other nodes. To that end die message comprises first information indicative 
for the location of the addressed module(s) within the network. The message may 
further include second information indicative for a particular location within the 
imyftii^ such as a memory, or a register address. The second information may invoke 
a particular response of the addressed module. 

It is an object of the invention to provide an integrated circuit and a method 
according to the introductory paragraph, which provides the modules therein a 
relatively simple way of issuing messages. 

In order to achieve said object the integrated circuit is characterized by the 
characterizing portion of claim 1. 

In the integrated circuit according to the invention modules can issue messages in a 
simple way, by using a single address. This makes it possible for a module to perform 
a write action to a particular memory address without being aware of die destination 
which comprises said address is stored. 

In this way the network appears to the model issuing the message as a bus. This 
makes it relatively simple to incorporate already existing modules designed for a bus 
like architecture in an integrated circuit according to the invention. 

As such, processing systems are known, where a processor is coupled via a bus to 
various memories, which each are mapped onto a respective portion of the total 
address rsnge. Byway of example a ROM and a RAM may be mapped to a first and a 
second address range respectively. When the processor performs a read instruction, 
the address in the instruction defines at the same time which memory is selected to 
read the data fiom. 
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to. such known processing systems each of the various modules* such as memories are 
directly coupled to the bus. In the integrated circuit according to the invention, 
selecting one of the modules implies that the one or other memories are set in a state 
wherein they do not interfere with the bus traffic; Apart, from the memory that is 
addressed no other module is required to perform an action (in fact, they don't have to 
and don't need to know that another module is active - i.e. they don't have to be 'set 
in a state'), or 2) that multiple concurrent and/or pipelined messages can be active 
simultaneously in the network as a whole. Jn an integrated circuit according to the 
invention however, information issued by the active module is transferred as a 
message via one or more nodes of the network. As a consequence it follows a 
different route through die network depending on the address. This route is scheduled 
by the network. 

Examples of the two pieces of information that are arranged as a single address are: 
Single logical memory space/map/ranga mapped to multiple distributed memories 
each with their own physical memory ranges- r- 

Virtual memory space mapped to a single logical memory space (distributed or not), 
Multiple memory spaces/maps/ranges mapped to multiple distributed memories. For 
2) and 3) two translations may take place (vm -> logical -> physical, and multiple -> 
single -> physical). 

The integrated circuit of claim 3 and the method of claim 4 provide another way of 
improving data transfer in an integrated circuit comprising a plurality of modules 
connected by a network. 

Theoretically a transaction could comprise any number of outgoing and/or return 
messages. In practice however a transaction is made up of one or two outgoing 
messages (from the first to the second module), and zero, one, or two return messages 
(from die second to the first module). By managing die outgoing messages in a way 
different from the return messages the overall efficiency of the network and therewith 
die integrated circuit comprising die network is improved. This is further illustrated 
with die following embodiments. 

With reference to claim 5 it is remarked that GT connections can. overbook resources 
in some oases. For example, when an AND? opens a GT read connection, it must 
reserve slots for the read command messages, and for the read data messages. The 
ratio between die two can be very large (e.g„ 1:100), which leads either to large slot 
tables, or bandwidth being wasted for the read command messages. In order to 
prevent as much as possible that a reservation for guaranteed traffic would impede 
other transactions the bandwidth which can be reserved should be restricted. On the 
other hand the best effort traffic may use any resources which are currently available. 
As a consequence guaranteed traffic has bounded but on average higher latency than 
best-effort traffic which has no fixed upper bound, but is (or should be) faster on 
average. 

Based on this recognition it has been found that the overall quality of the network 
transport could be improved by exploiting BE packets for read command 
messages, and GT packets for read data messages. No guarantees cm be offered in 
this case, but the overall throughput can be higher and more stable than in the case of 
using only BE packets. 
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With reference to claim 6 it is remarked that preferably Die outgoing transactions are 
handled in a locally ordered and the return transactions in a globally ordered 
transaction mode. The one or more adressed modules process the transactions in the 
order they have been issued, and the return part of the transactions are all delivered to 
the first module in the order in which it initiated the transactions. Even if ordered 
channels are used, Hie responses from different addressed modules (e.g., in a narrow 
cast connection) must be sorted at the first module. This kind of ordering conforms 
with AMBA 

To implement global ordering, transactions that are delivered to different second 
modules (also referred to as slave) must be ordered exactly as they were sent by the 
first module (also referred to as master). This means that the network should either 
have a global time indicator, and use e,g« deadline-based scheduling in the network 
while in addition assumption on the consumption time of the second models must be 
available. An alternatively way to introduce global ordering is to introduce explicit 
dependencies between transactions. The latter can be done by using 
acknowledged/tagged transactions, where proof of delivery to the slave is seat back to 
the master using an acknowledgement message. This solution, however, introduces 
extra latency because transactions are sequentialised with a round-trip delay/latency 
per transaction, (send a message, wait for the acknowledgement, send next message, 
wait for next acknowledgement etc.). By requiring only a local ordering for the 
delivery of the outgoing transactions, the slaves, provided that they are autonomous 
(which is usually the case) can execute messages independently. 

With reference to claim 7 it is remarked that in this way buffer space is used in 
an efficient way. A particular example is an embodiment wherein a large buffer 
space is reserved for the buffer of Hie network interface coupled to an active 
module, such as a module isuing a read command, and a small buffer space is 
reserved for the buffer of the network interface coupled to a passive module, e*g. ' 
the one receiving the read message. 

In other situations there may be different types of flow control (e.g. you never want 
to lose write commands, bat don't mind losing read data). If a module can do both 
read and write commands, it may be important that write transactions always succeed 
(e.g~ when writing to an interrupt controller), but that read transactions are not critical 
because they can be retried (so the GMD of the read transaction is dropped and the 
read never executed, or the RETDATA is dropped after the read has been executed. 
Another example is that if you know that writes always succeed if they are delivered, 
a flow-controlled connection is requested, Acknowledgements are not necessary in 
that case; Without flow control acknowledgements are compulsory, complicating the 
master and causing additional traffic, 

In the integrated circuit according to the invention the decision to drop messages or 
not is not decided per transaction but for the outgoing and return parts of connection 
as a whole. For example all outgoing messages having the format reads+address or 
writes+addressH-data) may be guaranteed lossless, while for all return messages 
(whether read data, write acknowledgements) packets may be dropped. 



A connection could be opened as follows: 
conoid = open ( 
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nofc/fc, 

outgoing unordered/local/global, 
outgoing buffer size, 
return unordered/local/global, 
return buffer size); 

ie. all outgoing messages have certain properties, and all return messages have 
certain properties. 

With reference to claim 8 it is remarked that in a processing system, with modules 
working asynchronously with respect to each other it is usual that a module receiving 
data issues an acknowledge signal to inform the issuing processor that it has received 
a message. In case that a message is multicast a plurality of said acknowledge signals 
is generated, which imposes a burden for the issuing processor. In the integrated 
circuit of the invention the first module receives only a single message, which reduces 
this burden. This measure is based on the insight that the network usually can 
relatively easily generate the single return message in response to the plurality of 
acknowledge messages of the second modules as a side effect of the functions already 
present in the network for other purposes. 

With reference to claim 9: Depending on the situation the single return message can 
depend on the acknowled messages in various ways. The embodiment of claim 2 is 
favorable where the addressed second modules are memories, and the first module 
attempts to store data therein. In that case it is sufficient that only one copy of the data 
is really received and stored. 

With reference to claim 10: In other situations it is compulsory that each of the 

addressed second modules has received the data. In the embodiment of claim 10 the 

single return message is not generated until this is the case. 

Otherwise the returrm message could be combined as follows. 

If each of the write transaction has been successfully executed by all slaves, all will 

return RETSTATrRETOK, which can be combined by 

the AND? in a single me s sage to be delivered to the master. 

If file write transaction has been successfully execu ted only by some slaves, there 
will be a mix of RETSTATs (RETOK and RBTERROR). They can either be 
combined into 

(a) a single RETSTATHRETERROR, to specify that an error occured, or 

(b) a single RETSTAT, but a larger one, more descriptive, encoding 
where there have been errors, AH RETSTATs can be bundled together 
in a single RETSTAT for the master, or <slave identifiers,etror code> 
paixs can be bundled to form a single RETSTAT for the master. 

If the connection has no flow control, messages can be dropped 

at die FNIFs, resulting also in RETSTAT=RETLOST messages. Again, combinations 

as those above can be made. 
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With reference to claim 11: In this way it is guaranteed that the first module always 
receives a response to a transaction, even if the connection has no flow control (Le. 
data may be dropped). This is done by only dropping data in the PNIP (thenetwork 
interface coupled to the second, receiving module), and returning a FAIL/ERROR to 
the ANIP (The network interface coupled to the first module). This return stains 
(RETSTAT) message win never he dropped because the ANIP that initiated the 
transaction will reserve space for return messages of every transaction that it initiates. 
This combination of reserving space and generating an error message whenever a 
message is dropped is a way to introduce flow control. Preferably the RETSTAT 
message is generated by the interface of the receiving module, although alternatively 
it could be generated at the intermediary network nodes too. 
The method according to the invention guarantees tranca**! ^ completion. Le. it is 
always known whether an initiated transaction 

(a) was delivered and executed successfully at the slave (RETSTAT=OK produced by 
the slave), or 

(b) was never delivered at the slave (RETSTAT=REQLOST pmdncedby fee PNIP), 
or 

(c) was delivered at the slave, but not successfully executed (RBTSTAT=ERROR 
produced by the slave), or 

(d) was delivered and executed successfully at the slave but the response message was 
dropped (RETSTAl^RETLOST produced by the ANIP). 

This is achieved by either 

(i) not dropping messages (flow-controlled connection^ in this case RETSTAT is 
either OK or ERROR, or 

Qi) by allowing messages to he dropped (on a connection without flow control), but 
generating a BEST AT (REQLOST or RETLOST) whenever the message is dropped, 
or aRETOK or RETERROR as usual when the message is not dropped. 

It is essential however, never to drop RETSTATs, because this completes the 
transaction/This is realized in that a buffer for the RETSTAT is located at the master's 
ANIP. The latter reserves space for RBTSTATs when initiating transactions, and 
bounds the number of outstanding transactions (for finite sized RETSTAT buffers). 

The flow control on the outgoing and return connections is in principle independent 
Thus, for outgoing flow control & return flow control, the RETSTAT message is 
according to a) or c) above 

In case of outgoing flow control & no return flow control, the RETSTAT message is 
a) or c) or d) above. 

In case of no outgoing flow control & return flow control, the RETSTAT message is 
a) orb) ore) above. 

Other embodiments are such an integrated circuit wherein the return message is a 
message indicating whether the second module has received amessage from fixe first 
module. In this embodiment the return message can be very compact, e.g. one or two 
bits to indicate one of the four options described above. 
Alternatively or in addition a return message comprises an identification of the 
message received by the second module. 

Page: 7 

1. I suggest "efficiency" instead of "performance because performance is just one of the 
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feetars. We my have the optical to reduce the cost of die network (e.g., reduce buffer sizes), or 
increase the peribzmance (e^„ by adding more connections for the same resources). 
Page: 3 ' 

2. This is an example for the use of different properties for outgoing and return parts. However, 
mora Can be defined: * * 
V Acknowledged write transaction: write command + outgoing data use guaranteed throughput 

(mode one in your example), and acknowled&neni uses best effort (mode two in your example) 
Moreover, except time-related guarantees, there is also a distinction on the buffering in both you and 
the above example. For data messages there is potentially more buffering allocated than for commands 
and acknowledgments. Consequently, fbr a read transaction (your example) buffers for the return part 
would be larger than fbose for the outgoing part, For the acknowledged write (the example above), 
buffers for the outgoing part are larger; and those for acknowledgments are smaller. 
Page: 8 

3. It is indeed possible to allocate different bandwidths as you suggest However, there are also 
limitations. We use a slot table, which contains a number of slots in a time window. Bandwidth is 
reserved allocating these slots to connections, For example, if we use a table with 100 slots for a tune 
flame of l\is, each slot will be allocated for 1/100 from lfxs « 10ns. If the network provides lGb/s par 
link, the bandwidth per slot will be 1/1 00 from lGbs 1 OMb/s, We can only allocate multiple of 
lOMb/s for guaranteed throughput traffic. 

For a read command generating long bursts, allocating the mfafmpm bandwidth of 1 OMb/s would be 
probably to much, as it will use only a small fraction of it The bandwidth can indeed be used by hesth 
eflfbrt traffic, however, not by other guaranteed throughput traffic. As a result, not all the traffic for 
which guarantees are needed may fit in the slot table. 

An alternative is to use more slots, but this increases the cost of the router. This is why, a best effort 
command may be a better solution. 
Page: 8 

4. This definition is good fbr outgoing messages, as there is one source (ANIP) and potentially 
multiple destinations (PNEPs). However, for return messages, we define global/local ordering as 
follows. Global ordering means that responses from all PNJPs/slaves (i.e. sources of messages in this 
case) com in die same order as fixe transactions have been initiated (ie., the same order as the 
commands have been issued by the master to the ANIP). Local ordering guarantees the order of 
response only if they come fiom die same slave/PNIP. 

5Page:8 

Slave modules 

Page: 8 

6. 'We can only guarantee the order we o ffer transactions to the slave module, bus the order of 
processing depends on the module implementation, It can well decide to process transactions in a 
dif fere nt order (e.g.* memory controller). For ordering we oidy require the responses are returned in 
the same order as the transactions were accepted. 

Page: 8 

7. This ia oitfy valid for global ordering- For local ordering (Le,, order preserved only per slave), 
if ordered transport channels are used, no sorting is necessary. . 

Page; g 

8. Global ordering of responses conforms wim AMBA, Local ordering of responses does not 
Page: 8 

9. IfhinkKees meant write transactions may be critical and we don't want to loose them, but 
fead transactions can be lost, because they can be tried later. See example below in the text 
Page: 8 

10. The two commands (i.e., read and write) can indeed be sent from the same module. If we set 
up a connection with flow control for die outgoing part both commands will be delivered. However, if 
the return part has no flow control, the responses for read commands maybe lost In such a case, the 
read transactions will fidL I think Kees meant read transactions being lost; not read commands being 
lost 

Page: 8 

11. fc = flow control, nofb = no flow control 
Page: 8 

12. Buffer is reserved only for a return status message, such as an acknowledgment or an error 
message. Buffer can be, but is not necessarily reserved also fbr returned data. 
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Page: 8 

13. Data ran also be dropped at the ANIP (Le., RBTDATA) when no flow control is implemented 
fcr the return part In such a case, a JRETSTAT=RETLOST will replace the RETSTAT^RETOK 
which accompanied the dropped RBTDATA. 

Page: 8 

14. Has reserved 
Page: 8 

15. Yes, this i$ true. Between routers, there is always link level flow control god no data is never 
lost Data can be lost only in the network interlaces, if no end-to-end flow control (here referred 
simply as flow control) is implemented. Therefore, here, messages reach the PNCP even when no (end- 
to-end) flow control is irr^lemented. 



These and other aspects are described in more detail in the following three annexes 

1. Communication Services for Networks on Chip, pages 1-25 by Andrei 
RSdulescu and Kees Goossens; 

Further background information useful for implementing the invention can be found 
at: - 

2. Networks on Silicon: Blessing or Nightmare? pp 1-5, by Paul Wielage and 
Kees Goossens (published), and 

3. Trade-Ofis in the Design of a Router with Combined Guaranteed and Best- 
Effort Services for Networks on Chip, pp 1-6, by Edwin Rijpkema, Kees Goossens, 
Andrei Radulescu, Jef van Meerbergen, and Paul Wielage, submitted to and rejected 
byISSS2002. 
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BSdntesm and Gcossens 



L Introduction 



Networks on chip (NoC) have received considerable attention recently 
as a solution to the interconnect problem in highly-complex chips [3-5,7- 
9, IS, 19, 22]. The reason is twofold. First; NoCs help resolve the electri- 
cal problems in new deep-subinicron technologies, as they structure and 
manage global wires [3-5,7, 8]. At the same time they share wires, lower- 
ing their number and increasing their nffllmtiftn [7, 8]. NoCs can also be 
energy efficient and reliable [4], and are scalable compared to buses [9]. 
Second, NoCs also decouple computation from communication, which is 
essential in managing the design of bfllion-tnuisistor chips [14, 22]. NoCs 
achieve this decoupling because they are traditionally designed using pro 
tocol stacks [21], which provide well-defined interfaces separating com- 
munication service usage from service implementation [5, 22], 

Using networks for on-chip communication when designing systems on 
drip (SoC), however, raises a number of new issues that mm be taken 
into account. This is because, in contrast to existing on-chip interconnects 
(e.g., buses* switches, or point-to-point wires), where the communicating 
modules are directly connected, in a NoC the modules com m u ni c ate re- 
motely via network nodes. As a result, interconnect arbitration changes 
fiom centralized to distributed, and issues like oue-of order transactions, 
higher latencies, and end-to-end flow control must be handled either by 
the intellectual property block (if) or by the network itself. 

Most of these topics have been already the subject of research in the field 
of computer networks [24] and parallel machine interconnect networks [6]. 
However, on-chip networks have different properties (e.g., tighter link syn- 
chronization) and constraints (e.g., higher memory cost) leading to differ- 
ent design choices, which in the end affect the network services. 

In Ibis paper, we compare NoCs and off-chip networks showing both 
their similarities and differences. We also explore the differences between 
NoCs and existing on-chip interconnects. We list new issues that must be 
resolved in system design due to the multi-hop nature ofNoCs, and present 
an interface which takes these issues into consideration, Our interface are 
aimed at being similar to a split-transaction bus interface, such as VCI [25] 
or OGP [17], to allow simple, low-cost wrappers to bus interfaces, and 
to allow backward compatibility with existing IPs* Our interface uses a 
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3 



request-response protocol that provides basic read and write operations, 
But our interface extends bus interfaces to fully exploit the power of our 
NoC [8, 19,20], For example, it offers connection-based communication 
where end-4o-end flow control and time-related guarantees (&g. 9 bounded 
latency) can be requested. 

The paper i$ organized as follows. In the nest two section* we compare 
NoCs properties with those of off-chip networks and buses, respectively, 
In Section IV, present the s ervices wo offer in our network Finally, we 
present our conclusions. 



. Networks have been the subject of research for decades, both in the 
contexc of local and wide area networks (computer networks) [24]* and 83 
an interconnect for parallel machines Both are very much related to on- 
chip networks, and many of the results in those fields are also applicable 
On chip. However, NoCs premises are different from off-chip networks, 
and, therefore, most of the network design choices must be reevaluated 

NoCs differ from off-chip networks mainly in their constraints and syn- 
chronization. Typically, most on-chip resources have much tighter con- 
straints compared to off-chip. Storage (Le^ memory) and computation re- 
sources arc relatively more expensive, whereas the number of point-to- 
point links is larger on chip than off chip \7], 

Storage is expensive, because general-purpose on-chip memory, such as 
RAMs, occupy a large area. Having the memory distributed in the network 
components in relatively small sizes is even worse, as the of overhead area 
in the memory then becomes dominant 

Also computation for on-chip networks comes at a relatively high cost 
compared to off-chip networks. An off-chip network interface usually con- 
tains a dedicated processor to implement Che protocol stack up to network 
layer or even higher, to off-load the host processor from the communica- 
tion processing. Including a dedicated processor in a network interface is 
not feasible on chip , as the size of die network interface will become com- 
parable to or larger than the IP to be connected to the network. Moreover, 
running the protocol stack on the IP itself may also be not feasible, be- 



lt Networks Brought on Chip 
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cause often these dps have one dedicated function only, and do not have 
the capabilities to ran a network protocol stack, 

The number of wires and pins to connect network components is an 
order of magnitude larger on chip than off chip [7], If they are not used 
massively for other purposes than NoC, they allow wide point-to-point in- 
terconnects (eg.. 300-tit links) [7, 15]. This is not possible off-chip, where 
links are relatively narrower 8-16 bits. 

On-chip wires are also relatively short, allowing a much tighter syn- 
chronization than off chip, This allows a redaction in the buffer space in 
the routers because the communication can be done ae a smaller granu- 
larity. In the current semiconductor technologies, wires are aI$o fest and 
reliable, which allows simpler link-layer protocols (e.g., no need for er- 
ror correction, or retransmission). This also compensates for the lack of 
memory and computational resources. 

In the rest of the section, we list five network issues tiiat have a direct 
impact on the NoC cost: reliable communication, deadlock; data ordering, 
network flow control and buffering strategy, and time-related guarantees 
For each of diem, we discuss the differences and similarities for on- and 
off-chip networks. 

Reliable communication. A consequence of the tight on-chip re- 
source constraints is (hat the network components (Le., routers and net- 
work interfaces) most be fairly simple to minimize computation and mem- 
ory requirements. Luckily, on-chip wires provide a reliable communication 
medium, which avoids the considerable overhead incurred by the off-chip 
networks for providing reliable communication. Data integrity can be pro- 
vided at low cost at the data link layer. However, data loss also depends 
on thft network architecture, as in most computer networks data is sim- 
ply dropped if congestion occurs in the network [6,24]. On-chip, dropping 
data may lead to a too costly implementation of reliable communication. 
We show below that a network where no data is dropped can lead to a much 
lower-cost solution, at the peril of introducing the possibility of deadlock. 

Deadlock, Compute network topologies have generally an irregular 
(possibly dynamic) structure and bidirectional links, which can introduce 
buffer cycles, In snob topologies, packet dropping at the network nodes 
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may be required to avoid deadlocks. 

Deadlock can also be avoided without dropping data, far example, by 
introducing constraints either in the topology or routing. Fat-rae topolo- 
gies have already been considered for NoCs, where deadlock is avoided by 
bouncing book packets In the network in case of overflow [9]. Tfle-based 
approaches to system design [7, 15,23] use mesh or torus network topolo- 
gies, where deadlock can be avoided using, for example, a turn-model rout- 
ing algorithm [6]. 

An alternative solution for deadlock in NoCs, which takes into consider- 
ation that modules connecting to the network are either masters (initiating 
requests and receiving responses), or slaves (receiving requests and send- 
ing back responses), is to maintain separate virtual networks (with separate 



Data ordering. In a network, data sent from a source to a destina- 
tion may arrive out of order due to reordering in network nodes, following 
different routes, or retransmission after dropping. For off-chip networks 
out-of-order data delivery is typical. However for NoCs where no data is 
dropped, data can be forced to follow the same path between a source and 
a destination (deterministic routing) with no reordering. This in-order data 
transportation requires less buffer space, and reordering modules are no 
longer necessary. 

Network flow control and buffering strategy* Network flow con- 
trol and buffering strategy have a direct impact on the memory utiliza- 
tion in the network. Wonnhole routing requires only a flit buffer in the 
router; whereas store-and-forward and virtual-cut-tbrough routing require 
at least die buffer space to accommodate a packet Consequently, on chip, 
wonnhole routing may be preferred over virtual-cut-through or store-and- 
forward routing. Similarly, input queuing may ba a lower memory-cost al- 
ternative to virtual- output-queuing or output-queuing buffering strategies, 
because it has fewer queues. Dedicated fifo memory structures at a low cost 
also enable on-chip usage of virtual-cut-through routing or virtual output 
queuing fer abetter performance [19]. However^ using virtual^t^hrou^r 
routing and virtual output queuing at the s ame time is still too costly [19]. 
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Tlmcrrolatod guarantees. Off-chip networks typically use packet 
switching and offer best-effort services. Contention can occur at each net- 
work node, making latency guarantees very hardto offer, Tliroughput guar- 
antee s can still be offered using s chemes such as rate-b ase d switching [26] 
or deadline-based packet switching [18], but with high buffering coats. 

An alternative to provide such time-related guarantees is to use time- 
division multiple access (TDMA) circuits, where every circuit is dedicated 
to a network connection. Circuits provide guarantees at a relatively low 
memory and computation cost. Network resource utilization is increased 
when the network architecture allows any left-over guaranteed bandwidth, 
to be used by best-effort communication [1 0, 1 9, 20]. 
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HL From buses to NoCs 

Introducing networks (Figure 1) as on-chip interconnects radically 
changes the communication when compared to direct interconnects, such 
as buses or switches (Figure 2). This is because of the multi-hop nature 
of a network, where communication modules qre not directly connected, 
but separated by one or more network nodes. This is in contrast with the 
prevalent existing interconnects (i.e.* buses) where modules are directly 
connected The implications of this change reside in jhe arbitration (which 
must change from centralized to distributed), and in the communication 
properties (ag., ordering, or Sow control). 
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Iii tins section, we list Some of these topics, and outline the differ- 
ences of NoCs and buses. We refer mainly to bases as direct intercon- 
nects, because currently they are the most used on-chip interconnect. Most 
of the bns characteristics also hold for other direct interconnects (e,g,, 
switches [16]). Multilevel buses are a hybrid between buses and NoCs. 
Depending cm the functionality of the bridges, for our purposes, multilevel 
buses either behove like simple buses [2] or like NoCs. 



Programming ModeL The programming model of a bus typically 
consists of load and store operations which are implemented as a se- 
quence of primitive bus transactions* Bus interfaces typically have dedi- 
cated groups of wires for command, address, write data, and read data [l> 
12,13,17,25]. 

A bus is a resource shared by multiple IPs. Therefore, before using it, 
IPs must go through an arbitration phase, where they request access to the 
bus, and block until the bus is granted to them. 

A bus transaction involves a request and possibly a response. Modules 
issuing request? are called masters, and those serving requests are called 
slaves. If there is a single arbitration for a pair of request-response, the 
bus is called non-^spliL In this case, the bus remains allocated to the master 
of the transaction until die response is delivered, even when this takes a 
long time. Alternatively, in a split bus, the bus is released after the request 
to allow transactions from different masters to be initiated. However, a 
new arbitration roust be performed for the response such that the slave cm 
access the bus [11 J. 

For both split and non-split buses, both communication parties have di- 
rect and immediate access to the status of the transaction. In contrast; net- 
work transactions are one-way transfers from an output buffer at the source 
to an input buffer at the destination that causes some action at the destina- 
tion, the occurrence of which is not visible at (he source [61. The effects of 
a network transaction are observable only through additional transactions. 
A request-response type of operation is still possible, but requires at least 
two distinct network transactions. Thus, a bus-like transaction in a NoC 
will essentially be a split nansaction. 
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Transaction Ordering. Traditionally, on a bus all transactions axe 
ordered (cf. Peripheral Vd [25], AMBA [1], or CoreCormect PLB and 
OPB [12, 13]). This is possible at a low cost, because the interconnect 
being a direct link between the communicating parties, does not reorder 
of data. However, on a split bus, a total ordering of transactions on a sin- 
gle master may still cause performance penalties, when slaves respond at 
different speeds, lb solve this problem, recent extensions to bus protocols 
allow transactions to be performed on connections. Ordering of transac- 
tions within a connection is still preserved, but between connections there 
are no ordering constraints (e,g, OCP [17], or Basic VCI [25]). A few of 
the bus protocols allow out-of-order responses per connection in their ad- 
vanced modes (e.g., Advanced Vd [25]), but both requests and responses 
amve at the destination in the same order as they were sent 

In a NoC, ordering becomes weaken Global ordering can only be pro- 
vided at a very high cost due to the conflict between the distributed nature 
of die networks, and the requirement of a centralised arbitration necessary 
for global ordering. 

Even local ordering, between a source-destination pair, may be costly. 
Data may arrive out of order if it is transported over multiple routes. In 
such cases, to stQl achieve an in-order delivery, data must be labeled with 
sequence numbers and reordered at the destination before being delivered. 

Atomic Chains of Transactions. An atomic chain of transactions is 
a sequence of Transactions initialed fay a single master that is executed on 
a single slave exclusively. That is, other roasters are denied access to that 
slave, once the first transaction in the chain claimed it, This mechanism is 
widely used to implements synchronisation *nftflT»>nigma between master 
modules (e.g., semaphores). 

On a bus, atomic operations can easily be implemented, as the central 
arbiter will either (a) lode the bus for exclusive use by the master request- 
ing the atomic chain, or (b) know not to grant access to a locked slave. 
In (he former case, the time resources are locked is shorter because once 
a master has been granted access to a bus, it can quickly perform all the 
transactions in the chain (no arbitration delay is required for the subsequent 
transactions in the chain). Consequently, the locked slave and the bus can 
be opened up again in a short time. This approach is used in AMBAj and 
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CoreConnect In the latier case, the bus is not locked, and can. still be used 
by other modules, however, at the price of a longer locking rime of the 
slave. This approached is used in Vd and OCR 

InaNoC, where the arbitration is distributed, masters do not know that 
a slave is locked. Therefore, transactions to a locked slaved may still be 
initiated, even though the locked slave cannot accept them. Consequently, 
to prevent deadlock, these other transactions must be either dropped* or 
stored such that transactions in the atomic diainCto be filtered and stai be 
served. Moreover, the time a module is locked is much longer in case of 
NoCs, because of the higher latency per transaction. 

Deadlock. In buses, the deadlocks are not generally an issue. Dead- 
lock can still occur at the application level (eg., an atomic chain of trans- 
action S that lodes the bus, which is never finished), but it is not caused by 
the interconnect itself, 

in a network, deadlock becomes a more important issue, and special 
care has to be taken in the network design to avoid deadlock. Deadlock is 
mainly caused by cycles in the buffers. Tb avoid deadlock, either network 
nodes must drop packets when their buffer are filled, or routing must be 
cycle-free. In a NoC, we believe the latter is preferable, because of its 
lower cost in achieving reliable communication (see Section H). 

A second cause of deadlock are atomic chains of transactions. The rea- 
son is that while a module is locked, die queues storing transactions may 
gat filled with transactions outside the atomic transaction chain, blocking 
the access of the transaction in the chain to reach the locked module. If 
atomic transaction chains must be implemented (to be compatible with 
processors allowing this, such as MIPS), the network nodes should be able 
to filter the transactions in the atomic chain, or be allowed to drop those 
Mocking them. 

Media Arbitration, An important difference between buses and 
NoCs is in die media arbitration scheme. In a bus, master modules re- 
quest access to the interconnect, and the arbiter grants the access for the 
whole interconnect at once. Arbitration is centralized as there is only one 
arbiter component, and global as all the requests as well as the State of the 
interconnect are visible to the arbiter. Moreover, when a grant is given, the 
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complete path, from the source to the destination is exclusively reserved. 

In a non^split bos, arbitration takes place once when a transaction is 
initiated. As a results the bus is granted for both request and response. In a 
split bus, requests and responses are arbitrated separately. 

la a NoC arbitration is also necessary, as it is a shared interconnect. 
However, in contrast to buses, the arbitration is distributed, because it ia 
performed hi every router, and is baaed only on local information. Arbi- 
tration of the communication resources (links, buffers) is performed incre- 
mentally as the request or response advances [19]. 

Destination Name and Routing: For a bus, the command, address, 
and data axe broadcas to d on the interconnect They arrive at every de sana- 
tion, of which one activates based on the broadcasted address, and executes 
the requested command. This is possible because all modules are directly 
connected to the same bus. 

In a NoC, it is not feasible to broadcast information to all destinations, 
because it must be copied to all routers and network interfaces. This floods 
the network with data, Hie address is better decoded at the source to find a 
route to the destination module. A transaction address will therefore have 
two parts: (a) a destination identifier, and (b) an internal address at the 
destination. 

Latency. Transaction latency ia caused by two factors: (a) the access 
time to the bus, which is die time until the bus is granted, and (b) the 
latency introduced by the interconnect to transfer the data. 

For a bus, where the arbitration is centralized the access time is pro- 
portional to the number of masters connected to the bus. The transfer la- 
tency itself typically is constant and relatively fast, because the modules 
are linked directly. However, the speed of transfer is limited by the bus 
speed, which is relatively alow for buses. 

In a NoC, arbitration is performed at each router for the following link. 
The access time per router is small. Both end-to-end access time and trans- : 
port time increase proportionally to the number of hops between master 
and slam However; network links are unidirectional and point to point, 
and hence can nm at higher frequencies than buses, thus lowering the la- 
tency. 
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From a latency prospective, using a bus or a network is a trade off be- 
tween the number of modules connected to the interconnect (which affects 
access time), the speed of the interconnect; and the network diameter. 

Data Format In most modern bus interfaces the data format is de- 
fined by separate wire groups for the transaction type, address, write data, 
read data, and rottrm acknowledgments/errors (e.g., VCI, OCP, AMBA, or 
CoreConnact), This is used to pipeline transactions. For example, concur* 
rently with sending the address of a read transaction, the data of a previous 
write transaction can be sent; and the data . from an even earlier read trans- 
action can he received. Moreover, having dedicated wire groups simplifies 
the transaction decoding; there is no need for a mechanism to select be- 
tween different kinds of data sent over a common set of wires. 

Inside a network, there is typically no distinction between different 
kinds of data. Data is treated uniformly, and passed ftom one router to 
another, This is done to rnftrimfra the control overhead and buffering in 
routers. If separate wires would be used for each of file above-mentioned 
groups, separate touting, scheduling, and queuing would be needed, in- 
creasing the cost of routers. 

In addition, in a network at each layer in the protocol stack, control in- 
formation most be supplied together with the data (e.g„ command type, 
address, or data size). This control information is organized as an envelope 
around the data. That is, fast a header is sent; followed by the actual data 
(payioad), followed possibly by a trailer; Multiple such envelopes may be 
provided for the same data, each carrying the corresponding control infor- 
mation for each layer in die network protocol stack [6, 24). 

Buffering and Blow Control. Buffering data of a master (output 
buffering) is used both for buses and NoCs to decouple computation from 
communication, However, for NoCs output buffering is also needed to. 
marshal dam, which consists of (a) (optionally) splitting the outgoing data 
in smaller packets which are transported by the network, and (b) adding 
control information for the network around the data (packet header), lb 
avoid output buffer overflow the master must not initiate transactions that 
generate more data than the currently available space. 

Similarly to output buffering, input buffering is also used to decouple 
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computation from communication, la a NoC, input buffering is also re- 
quired to umnar&hal data* 

In addition, flow control for input buffers differs for buses and NoCs. 
For buses, iha source and destination are directly linked, and, destination 
can therefore signal directly to a source that it cannot accept data. This 
information can even be available to the arbiter, such that the bus is not 
granted to a transaction trying to write to a Ml buffer. 

In a NoC, however, the destination of a transaction cannot signal di- 
rectly to a source that its input buffer is fUlL Consequently, transactions 
to a destination can be started, possibly from multiple sources, after the 
destination's input buffer has filled up. Two policies can be adopted when 
an input buffer is full. The first is not to accept additional incoming transi- 
tions, an d to store them in the network. However, this approach can easily 
lead to network congestion, as the data could be eventually stored all the 
way co the sources, blocking the links in between. Hie second approach is 
to accept incoming transactions at a fun destination, and drop some data 
in the input buffer. Congestion is avoided but data is lost, which is unde- 
sirable. 

To avoid output buffer overflow connections can bo used, together \yith 
end-to-end flow control. At connection set up between a master and one 
.or more slaves, buffer space is allocated at the network inxerftces of the 
slaves, and the network interface of the master is assigned credits reflecting 
the amount of buffer space at the slaves. The master can only send data 
when it has enough credits for the destination slave(s). The slaves grant 
credits to the master when they consume data. 



IV. The Ethereal Approach 

As described in the previous two sections, NoCs have different prop* 
cities from both existing off-chip networks and existing on-chip inter- 
- -connects, As a result, existing protocols and service interfaces cannot be 
adopted directly to NoCs, but must take the characteristics of NoCs into 
account For example, a protocol such as TCP/IP assumes the network is 
lossy, and includes significant complexity to provide reliable communica- 
tion. Therefore, it is not suitable in a NoC where we assume data transfer 
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leliabxtity is already solved at a tower level On die other hand, existing 
on-chip protocols such as VCI, OCP, AMBA, or Core Connect are also not 
directly applicable. For example, they assume ordered transport of data: 
if two requests art initiated from the same master they will arrive in the 
same order at the destination. This does not hold automatically for NoCs. 
Atomic chains of transactions and end-to-end flow control also need spe- 
cial attention in a NoC interface, 

Our objectives when defining our network services aid the following. 
First, die services abstract from die network internals as much as possible. 
This is a key ingredient in tackling the challenge of decoupling the com- 
putation from communication [14,22], which allows IPs (the computation 
part), and the interconnect ((he communication pan:) to be designed inde- 
pendently from each other. As a consequence, our services axe positioned 
at the transport layer in the ISO-OSI reference model [24], which is the 
first layer to be independent pf the implementation of the network. 

Second, we aim at a NoC interface as close as possible to a bus inter- 
face. NoCs can then be introduced non-disraptively: with minor changes, 
existing IPs, methodologies and tools can continue to be used. As a conse- 
quence, we use a request-response interface, similar to interfiles for split 
buses [1,12, 13, 17,25]. 

Third, our interface extends traditional bus interfaces to fully exploit 
the power ofKoCs. For example, we offer connection-based communica- 
tion which does not only relax ordering constraints (as for buses), but also 
enables new communication properties, such as end-to-end flow control 
based on credits, or guaranteed throughput [8, 19, 20]. All these properties 
can be set fin- each connection individually. 



A, The ethereal Connection and Transaction Model 

IPS interact with our network [8, 1 9, 20] at so-called network interfaces 
(Ni). nis provide NI ports (nip) through which the communication services 
are accessed. As shown in Figure 3, a Ni can have several Nips to which one 
or more IPS (computation elements or memories, but not interconnection 
elements) can be connected. Similarly, an IP can be connected to more than 
one Nis and nips. 
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Figure 3. Examples of links between Ms and IPS. 

Commimioadon between nips is performed on connections. Connec- 
tions are Introduced to describe and identify communication with different 
properties, such as guaranteed throughput bounded latency and jitter, or- 
dered delivery, or flow coniroL For example, to distinguish and Indepen- 
dently guarantee communication of lMbs and 25Mbs, two connections 
can be used. Two Nits can be connected by multiple connections, possi- 
bly with different properties. Connections as defined here are similar to the 
concept of threads and connections from OOP and VOL Where in OCP and 
VCI connections are used only to relax trans action ordering, we generalize 
from only the ordering property to include configuration of buffeting and 
flow control, guaranteed throughput; and bounded latency per connection. 

./Ethereal connections must be created with the desired properties before 
being used. This may result in resource reservations inside the network 
(e.£, buffer space, or percentage of die link usage per time unit). If the 
requested resources are not available, the network will refuse the request 
After usage, connections are closed, which leads to freeing (he resources 
occupied by that connection. 

To allow more flexibility in configuring connections, and, hence, better 
resource allocation per connection, the outgoing and return parts of con- 
nections are configured separately. For example, different buffer space can 
be allocated in the ANIP and PKIPs, respectively, or different bandwidths 
can be reserved for requests and responses. 

Depending on the requested services, the time to handle a connec- 
tion (Le.„ creating, closing, modifying services) can be short (e.g., creat- 
ing/closing an unordered, lossy, best-effort connection) or significant (e.g., 
creating/dosing a multicast guaranteed-througbput connection). Conse- 
quently, connections are assumed to be created, closed, or modified infre- 
quently, coinciding eg. with reconfiguration points, when the application 
requirements change. 

Communication takes place on connections using transaction, consist- 
ing of a request and a possibly response, The request encodes an operation 
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Figured TYansaction composition, 

(e.g^ read, write, flush, test and set, nop) and possibly carries outgoing 
data (e.g., for write commands). The response returns data as a result of a 
command (eg., read) and/or an acknowledgment, 

Connections involves at least two NIP3* Transactions on a connection 
are always started at one and only one of the nit s, called the c onnectfon's 
active nip (anjp). All the other nips of the connection axe called passive 

NJP$(PNIP). 

There can be multiple transactions active on a connection at a time (as 
for split buses). That is* transactions can be started at the ANIP of a connec- 
tion while responses for earlier transactions are pending* If a connection 
has multiple slaves, multiple transactions can be initiated towards different 
slaves. Transactions are also pipelined between a single pair of a master 
and a slave for both requests and responses. Xa principle, transactions can 
also be pipelined within a slave, if the slave allows this, 

A transaction is composed from the following massages (see Figure 4): 

• A command me&^&t (OMD) is sent by the anep, and describes the 
action to be executed at the slave connected to the pnip. Examples 
of commands are read, write, test and set; and flush* Commands 
are the only messages that are compulsory in a transaction. For 
nips that allow only a single command with no parameters (e.g,, 
flxed^ize address-less write), we assume the command massage 
still exists, even if it is implicit (ie^ not explicitly sent by the IP). 

• An out data message (outdata) is sent by the and> following a 
command that requires data to be executed (eg., write, multicast, 
and test-and-set). 

• A return data message (retdata) is sent by a pnip as a conse- 
quence of a transaction execution that produces data (e,g., read, 
and test-and-set). 

• A completion acknowledgment message (rbtstat) is an crptional 
message which is returned by PNDP when a command has been 
completed. It may signal either a successful completion or an er- 
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OK/ERR PNIP 
acknowledged wrfta multicast 
Figures. Transaction examples. 



rem For transactions including both RBTDATA and RETSTaT the 
two messages can be combined in a single message for efficiency. 
However, conceptually, they exist both: ketstat to signal the 
presence of data or on error, and retdata to carry the data. In 
, bus-based interfece* rjptpata and retstat typically exist as two 
separate signals [1, 12, 13, 17,25]J 

Messages composing a transaction are divided in outgoing messages, 
namely cmd and outdata, and response messages, namely rbtdata, 
RETSTAT. Wlthm a transaction, CMD precedes all other message, and 
RBTDATA precedes retstat if present These rules apply both between 
master and anip, and pnip and slave. Examples of transactions are shown 
in Figures, 

We classify connections as follows (see Figure 6); 

• A simple connection is a connection between one anip and one 
PNrp, 

• A narrawcast connection is a connection between one anip and 
one or more fwps, in which the anip initiates transactions that 
are executed by exactly one pnip. An example of the narrow- 
cast connection is shown in Figure 7, where the anip performs 
transactions on an address space which is mapped on two mem- 

" ory modules, Depending on the transaction address, a transaction 
is executed on only one of these two memories, 

• X multicast connection is a connection between one anip and 
one or more pnips, in which the sent messages are duplicated and 
each Pnip receives a copy of those messages, In a multicast con- 
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Figure 7. Anairowcast 
connection. 



nection no return messages are currently allowed, because of the 
Uuge traffic they generate (i.c^ one response per destination). It 
could also increase the complexity in the a^TF because individual 
responses from PNiPs must be merged into a single response for 
the AN IP. This requires buffer space and/or additional computa- 
tion for the mei^ing itself. 



B. Connection Properties 

la this section we describe the properties that can be configured for 
4 connection: guaranteed message integrity, guaranteed transaction com- 
pletion, various transaction caterings* guaranteed throughput, bounded la- 
tency and jitter, and connection flow control. 

Data Integrity. Data integrity means thai the payload of the message 
is not changed (accidentally or not) during transport We assume that data 
integrity is already solved at a lower layer hi our network, namely at the 
link layer, because in currant on-chip technologies data can be transported 
uncomipted over links. Consequently, our network interface always guar- 
antees that messages are delivered uncorraptad at the de s ti n ation. 

transaction Completion. A transaction without a response is said to 
be complete when it has been executed by the slave. As there is no response 
message to the master, no guarantee regarding transaction completion can 
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Figure 8. Message ordering is observable at a,b,e, and d. 
be given. 

A transaction with a response is said to be complete when a RBtStat 
message is received from the an IP 1 * Hie transaction may either (a) be 
executed successfully, in which case a success retstat is returned, (b) 
fail in its execution at the slave* and then an execution error retstat is 
returned, or (c) fail because ofbuffer overflow in a connection with no flow 
control, and then ic reports an overflow error, 

In our network, rooters do nor drop data [20], therefore, massages are 
always guaranteed to be delivered at the Nl. For connections with flow 
control, also wis do not drop data* Thus, message delivery to the IPs is 
guaranteed automatically in this case. 

However, if there is no flow control, messages may be dropped at the 
network interface in case of buffer overflow (see the paragraph on end-to- 
end flow control below). All of Cmd, outdata* and Mtdata may be 
dropped at the Ml. Ib guarantee transaction completion, retstat is pot 
allowed to be dropped. Consequently, in the anips enough buffer space 
must be provided to accommodate retstat messages for all outstand- 
ing transactions. This is enforced by bounding the number of outstanding 
t ronsacttons . 

Transaction Ordering. In this section, we describe the ordering re- 
quirements between different transactions within a single connection. Over 
different connections no ordering of transactions is defined at the transport 
layer, 



I We assume that when data is received as * rcapona d (RETData), a retstat (possibly 
implicit) to also received to validate ihc dam. 
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There several points in a connection where order of transactions can be 
observed (see Figure 8): (a) the order in which the master presents cmd 
messages to the anjp, (b) the order in which the CMDs are delivered to the 
slave by the jpnxp, (e) the order in which the slave presents the responses 
to che pnip, and (d) (he order be responses are delivered to the master by 
the anip. Note that cot all of (b), (o), and (d) are always present More- 
over, there are no assumptions about the order in which the slaves execute 
transactions; we can only observe the order of the responses. We consider 
the order of the transaction execution to be a system decision, and not a 
part of the interconnect protocol 

At both anip and pnips, outgoing messages belonging to different 
transactions on the same connection are allowed to be interleaved. For 
example, two write commands can be issued, and only afterwards their 
data follow?. If the order of OUTData messages differs from the order 
of cmd messages, transaction identifiers must be introduced to associate 
OUttMTAs with their corresponding CMD, 

Outgoing messages can be delivered by the pnips to the slaves (see 
Figure 8-b) as follows: 

• Unordered* which imposes no order on the delivery of the outgo- 
ing messages of different transactions at the pnips. 

• Ordered locally, where transactions must be delivered to each 
pnip in the order they were sent, but no order is imposed across 
pnips. Locally-ordered delivery of the outgoing messages can be 
provided either fay an ordered data transportation, or by reordering 
of outgoing messages at the pnip. 

• Ordered globally, where rrmsantfons must be delivered in die or- 
dfir they were sent, across all pnips of the connection. Globally- 
ordered delivery of the outgoing part of transactions require a 
costly synchronization mechanism. 

Transaction response messages can be delivered by the slaves to the 
PN»S (see Rgure 8-e) as follows: 

• Ordered* when bjbtdaxa and rbtstat messages are returned in 
the same order as the CMOS were delivered to the slave. 

• Unordered, otherwise. 
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Wifln responses are unordered, there has to be a mechanism to identify 
the transaction to which a response belongs. This is usually dona using 
tags attached to messages for transaction identifications (similar to tags in 
VO). 

Response messages can be delivered by the a*w to the master (see 
Figure 8-d) as follows: 

• Unordered, which imposes no order on the delivery of responses. 
Here, also, tags must be used to associate responses with their 
corresponding cmds, 

• Ordered locally, where retdata and retstat messages of trans- 
actions for a single slave are delivered in the order the original 
cmds wore presented by the master to the anip. Note that there is 
no ordering imposed for transactions to different slaves within the 

. same connection. ^ 

• Globally ordered, where all responses in a connection are deliv- 
ered to the master in the same order as the original cmds. When 
transactions are pipelined on a connection, then globally-ordered 
delivery of responses requires reordering at the ANIP. 

AU3x2x3 = l8 combinations between the above orderings are pos- 
sible. Out of these, we define and offer the following two. An unordered 
connection is a connection in which np ordering is assumed in airy part 
of the transactions. As a result, the responses must be tagged to be able 
identify to which transaction they belong. Implementing unordered con- 
nections has low cost, however, they may be harder to use, and introduce 
the overhead of tagging. 

An ordered connection is defined as a connection with local ordering 
for the outgoing messages from pnips to slaves (Figure S-b), ordered re- 
sponses at the pnips (Figure 8-e), and global ordering for responses at the 
ANIP (Figure 8-d). We choose local ordering for the outgoing part because 
the global ordering has a too high cost, and has few uses. The ordering of 
responses is selected to allow a simple programming model with no tag- 
ging. Global ordering at fee AN IP is possible at a moderate cost, because 
all the ordering is done locally in the anip. 

A user can emulate connections with global ordering at fee pnips using 
non-pipelined acknowledged transactions. 
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Connection latency, throughput, and jitter, lu our network, 
throu^jput can bo reserved for connections in a time-division multiple ac- 
cess (TDMA) ftshion, whore bandwidth is split in fixed-size slots on a 
fixed time frame. Bandwidth, as well as bounds on latency and jitter can 
bo guaranteed when slots are reserved. They are an defined in multiples of 
the slots. 

Guaranteed^hrougl^ut connections can overbook resources in some 
cases. For example, when an anip opens a gnaraateed-thioughpui read 
connection, it must reserve slots far the read command messages, and for 
the read data messages. The raHo between the two can be very large (eg,, 
1:100), which leads either to a large number of slots, or bandwidth being 
wasted for die read command messages. 

To solve this problem, we allow the request aod response parts of a 
connection be configured independently for all of throughput, latency and 
jitter. Consequently, the request pan of a connection can be best effort, 
While the response can have guaranteed throughput (or vice versa). For 
the example mentioned above, we can use best effort read messages, and 
goaranteed-througfaput read-data messages. No global connection guaran- 
tees can be offered in this case, but the overall throughput can be higher 
and more stable than in the case of using only best-effort traffic. 



Connection flow control. As mentioned earlier; our network guaran- 
tee? that messages are delivered to the NI. Messages sent from one of the 
kips are not immediately visible at the other W, because of die multi-hop 
nature of networks. Consequently, handshakes over a network would allow 
only a single message be transmitted at a time. This limits die throughput 
on a connection and adds latency to transactions. Tb solve this problem, 
and achieve a better network utilization, the messages must be pipelined. 
In this case, if the data is not consumed at the PNIP at the same rate it 
arrives, either flow control must be introduced to slow down the producer, 
or data may be lost because of limited buffer space at the consumer ni. 

We introduce end-to-end flow control at the level of connections, which 
requires buffer space to bo associated with connections. End-to-end flow 
control ensures that messages are sent over the network only when there is 
enough space in the nip's destination buffer to accommodate them. 

End-to-cud flow is optional (i.e., to be requested when the connections 
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opened) and can be configured independently for the outgoing and ra» 
turn paths, When no flow control is provided, messages are dropped when 
buffers overflow, Multiple policies of dropping messages are possible, as 
in off-chip networks- Possible scenarios include: (a) the oldest message is 
dropped (milk policy), or (b) the newest message is dropped (wine pol- 
icy) [24]. 

We opt for a credit-based flow control, Credits are associated with the 
empty buffer space at the receiver Nl The sender's credit is lowered as data 
is sent. When data is delivered at the receiver NIP, credits are granted to 
the sendee If the sender's credit is not sufficient to send some data, the Ni 
at the sender stalls the sending. 



C XJse Cases 

lb illustrate the need for differentiated services on connections, we 
show in this section some examples of traffic. We describe the properties 
they would use over an /Ethereal connection to meet their traffic require- 
ments. 

Video processing screams typically require a lossless, in-order video 
stream with guaranteed throughput, but possibly allow corrupted samples, 
An iStbereal connection for such a stream would require the necessary 
throughput, ordered transactions, and flow control. If the video stream is 
produced by the master only write transactions axe necessary. In such a 
case, with a flow-controlled connection there is do need to also require 
transaction completion, because messages are never dropped, and the write 
command and its data are always delivered at the destination. Data in- 
tegrity id always provided by our network, even though it may be not nec- 
essary in this case. 

Another example Is that of cache updates which require uncomipted, 
lossless, low-latency data transfers but ordering and guaranteed through- 
put are less important In such a case, a connection would not require any 
Hmft related guarantees, because a low latency, even if preferable, is not 
critical. Low latency can be obtained even with a best effort connection. 
The connection would also require flow control and guaranteed transac- 
tion completion to ensure loss-less transactions. However, no ordering is 
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necessary, because this is not important for cache updates, and allowing 
out of order transaction can reduce the response time. 



V, Conclusions 

In this paper, we compare networks on chip (NoC) to off-chip networks 
(eg., computer networks) and existing on-chip interconnects (e.g., bosses). 
We show that NoCs have many similarities with off-chip networks. How- 
ever, they also differ, especially in their resource constraints. For example 
on a chip, memory and computation resourced are more expensive, while 
there are more wires. This makes NoC architectures different from off-chip 
networks, and requires rethinking of network services. 

We also compare NoCs to existing on-chip interconnects, such as buses 
and switches. By directly connecting ip blocks, existing oivchip intercon- 
nects can offer tight coupling between masters and slaves, and global ar- 
bitration. In NoCs, masters and slaves axe completely decoupled, and the 
arbitration is distributed over the network nodes. This make it harder to 
provide guarantees, such as bandwidth lower bounds, and transaction or- 
derings. 

We define a set of NoC services that abstract from the network details. 
Using these services in the IP design decouples computation and communi- 
cation, We use a request-response transaction model to be close to existing 
on-chip interconnect protocols. This eases the migration of current IPs to 
NoCs. lb fully utilize the NoC capabilities, such as high bandwidth and 
transaction concurrency, our services provide connection-oriented com- 
munication, Connections can. be configured independently with different 
properties. These properties include transaction completion, various trans- 
action ordering, bandwidth lower bounds, latency and jitter upper bounds, 
and flow control. 

Our services are a prerequisite for pervic^based system design which 
makes applications independent of NoC implementations, makes de- 
signs more robust, and enables architecture-independent quality-of-service 
st r ate g ies. 
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Abstract 

Continuing VLSI technology seating raises several deep 
Submicran 0SM) problems like relatively slow intercon- 
nect, power dissipation and distribution, and signal in- 
tegrity Those problems are encountered particularly on 
long Wires Jbr global interconnect As chekfiequencies in* 
crease* scaled wires become relatively slower, and on-chip 
communication will be the limiting performance factor of 
fixture chips. We explain why efficiently sharing of the ytfres 
for longdistance communication is foe solution to this prob- 
lem. We introduce networks on silicon (NoS), that route 
packets over shared (semtj-giobal wires. NaS performance 
is expected to be high, but comes at a cost, balancing the 
performance and cast of a NoS is a major challenge, and 
we believe busses still have a role play. 



1 Technology trend 

VLSI technology scaling has long followed Moore's law. 
If o fundamental b antes have been identified that invalidate 
this law for at least another decade [12]. Moore's law pre- 
dicts that chips in 2010 win count over 4 billion transis- 
tors, operating in the multi-GKz range. This abundance of 
transistors will make very compiexsystems on silicon (SoS) 
possible. 

However, challenges at all abstraction levels of design 
will have to be addressed before such So$s ^fll become a 
reality. The three most important deep snbmicron (DSM) 
challenge?, related to all abstraction levels, axe: substantial 
wire delay, controlling power delivery and dissipation, and 



Until recently, on-chip wiring was cheap. Consequently 
architectural models have been employed that relied on low- 
latency communication to globally share expensive compu- 
tational resources. Global wire delay stays at best constant 




fcsctrvely slower compared to a gate delay. For example, 
for 130 nm technology the reachable distance of a repeated 
global signal in a clock cycle is no more than the length of a 



figure 1. The number of 50k blocks for future 
process technologies. 



chip [41- For 50 nm technology, crossing a chip with highly 
optimized interconnect takes between six and ten clock- 
cycles, clearly invalidating the low-latency assumption of 
today. Hence we must move to system-level architectures 
that scale with technology. 

A feasible template for a future-proof architecture is con- 
structed from processing nodes that do not grow in com* 
plexfcy with technology. Instead, as technology scales, the 
number of these processing nodes on the chip grows, An 
on-chip communication network then combines these nodes 
into a SoS [4J. 

Various publications show that the spanning wires in 
blocks of 50k gates, scale with technology [4, J3J. This 
means that the aforementioned DSM issues can bo handled 
by CAP tools, assuming their evolutionary improvement. 
Figure 1 shows the exponentially increasing amount of such 
50k blocks for a large die in subsequent technologies; in 
35 run this nnmber is approximately ten thousand (adapted 
from [13] and [4]). It remains to find a communication ar~ 
chitectura that allows a SoS composed of these blocks co- 
operate efficiently. 

2 Networks on silicon are inevitable 

Given the growing demand for and impact of intercon- 
nect on system cost and performance, it is worthwhile to op- 
timize the utilization of wires. Ad-hoc global wiring stnn> 
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luxes often lead to a huge number of wires with an aver- 
age usage as tow as 10% in time [2], To control cost in 
tills scenario, me wire packing density must be very high, 
which is not benepcia] for dm power and delay characteris- 
tics. Efficient mechanisms for sharing (scmi)-global wires 
must solve this cosi-pcrfomance dilemma. 

In deep snbmicron technologies, (semO-gtabal wires 
need special attention for power, signal-integrity, and per- 
formance reasons. In did discussion below we show how 
special circuit techniques can handle these issues, Such 
techniques only work, however, when embedded in ded- 
icated communication IP, which provides a more abstract 
interface, 

Power is an issue for global interconnect because it costs 
more energy to send a bit of information over longer the 
wires, lb reduce the communication delay, the energy con- 
sumption increase? due to bigger drivers. Employing low- 
swing signaling for the global wires saves up to a factor four 
in power for those wires [15]. Implementing low-swing sig- 
naling requires special circuit techniques. 

Signal integrity is hampered increasingly by growing ca- 
paciuve and inductive coupling between wires. Capadiive 
noise coupling is the result of the large aspect ratio of wires 
in DSM technologies. Inductive noise coupling becomes 
more of a problem due to the decreasing transition times. JR 
drop 1 hvthe supply distribution increasingly c on t ribut es to 
the noise. His most effective way p make a connection ro- 
bust against noise is application of differential signaling [7J. 
Differential signaling improves both the generation of and 
sensitivity to noise. 

The signal propagation delay of an uninterrupted wire 
grows quadradcally with its length; hence from a certain 
length onwards it is advantageous to partition the wire in 
segments with repeaters in between. The repeater in&erdon 
techniojue improves bandwidth and latency but at the cost of 
higher power consumption. Wire delay can be reduced by 
fax wires with a lower resistance per unit length at the cost 
of lower wire density. Such wires behave like lossy trans- 
mission lines and require drivers with a resistance matched 
to the transmission line. 

As a result, we believe that all inter-block comrmmica- 
tion will be implemented by hard-macro transmitters and 
receivers, employing low-swing differential signaling, with 
wen-controlled interconnect instead of ad-hoc drivers han- 
dled by standard plaee*andVroute tools. In this way, commu- 
nication links can be realised with predictable performance 
and DSM robustness* 

Currently, die prevalent on-chip interconnects are 
busses [1]. In a bus architecture, devices share a single 
transmission medium to coinmimicate. At a given time, 

'Supply voltage drops arc wused by high cnnrnfa (I) Gp^ying through 
ihciisijmeeqOirftiiccnp^ 
muter scaling JR dtOfi worsens. 
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Figure 2. Structural view of a network on sil- 
icon consisting of processing nodes (P) and 
nodes supporting communication (R, B). 



only one device has access to the shared medium. An ar- 
bitration mechanism is required to order simultaneous ac- 
cesses. Such functionality is typically performed by a cen- 
tralized bus arbiter, The performance of a shared-rnedium 
bus scales badly. For an increasing number of bus clients 
(i) individual clients get less bandwidth cm average, and (ii) 
increased capacifive loads and wire length, decrease the total 
bind width* 

A solution that pairs scalable communication perfor- 
mance and minimal interconnect cost is expected from net- 
works on silicon (NoS) where die SoS is considered as a 
network of components [2, 3, 1]. Figure % illustrates the 
hardware architecture of this concept The outer compo- 
nents (marked P) exclu sively perform processing and stor- 
age functions, whereas the inner components (marked B and 
R) form the NoS and cater to communication needs of the 
outer components. The basic building blocks of a NoS are 
routers (R). 

A router forwards data from its input ports to its out- 
put ports in a concurrent fasfcioa To that end, a router of 
arity N contains a N x N switch matrix. Data packets 
make their way through the network based on the routing 
information in their headers. A link between two routers is 
implemented by a point-to-point connection. Hie links typ- 
ically span medium to long di st a n ces ranging flam several 
to over more than twenty milTnnflters. The actual length de- 
pends on the chosen topology of the network. For a mesh 
topology the links are relatively short, tor a torus which is 
a mesh with wrap-around connections, some links have a 
length of half the edge of the chip. links can be optimized 
for bandwidth, latency, power, or a combination of these, 
depending on performance requirements. 

3 NoS requirements 

An important characteristic of a future system-level ar- 
chitecture is the separation between computation and com- 
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rnunicatiom A NoS allows the computational blocks to 
connnanicato with one other via a uniform interface. A 
uniform interface is advantageous because (i) it tecs the 
core developer fonn having to make assumptions about the 
^stem in which the core win be used, and (5j) does not 
constrain die development of newer commnxueadon archi- 
tectures by detailed interfacing requirements Qf particular 
legacy SoC components [6]. Several on-chip bus standards 
axe evolving to realize this goal, most notably VCI, put fo> 
ward by VSlA [14], and more recently, the Open Core Pro* 
tocol [10], 

The fundamental aim of a NoS is to provide flexible and 
efficient communication between the thousands of IP blocks 
in a system, with performance guarantees. In a typical SoS, 
the communication demands of different ZP blocks show 
large variations. For example, data rates may bo constant 
(e.g, digital video) or variable (eg. compressed video). The 
Importance of latency and jitter also varies greatly. Finally, 
.the data granularity may range from single words to large 
blocks. A NoS should be able to offer different services to 
different clients. Each service class must be implemented 
efficiently, using a shared uniform infrastructure. 

A high utilization of die network comes at a price. When 
the network starts to saturate, throughput and latency will 
show huge variations, which is not acceptable in real-time 
applications. Hence, the network should also provide guar- 
antees* like loss-loss dam transport, minimal bandwidth, 
and bounded latency, The way packets are buffeted and 
scheduled in routers, and the effects on performance guar* 
antees has been the subject of intense research, funda- 
mentally, sharing and guarantees are conflicting, and effi- 
ciently combining guaranteed traffic with best-effort traffic 
is hard [11]. Although best-affiort services are cheaper than 
guaranteed services we believe that the latter am essential 
because they enable compositional and scalable integration 
Qf the IP blocks [5]. It Is up to the IP integrator at design 
time, and up to the application at run time, to make a trade 



4 Performance and cost analysis of NoSs 



The vision of previous sections is that the design of fu- 
ture SoSs will aUow IP blocks to be plugged in at win to 
"lfnlro 1 ^ communicaticn costs, but without today's prob- 
lems like timing closure. In this section we investigate the 
cost implications of system design based on a NoS. We hope 
the vision comes at acceptable cost, V?e hope mat the over* 
. all cost of a $Tq3, including the full protocol stack to use it, 
torn out to be acceptable such that the integration blessings 
of NoSs do not change into a cost nightmare. 
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4.1 Performance 

The aggregate bandwidth of a renter is the product of tho 
baud widlhper port, BWpon* tfre arity of the router (number 
of ports), N, and a utilization factor, a £ I corresponding 
to the router arbitration scheme, 

BWwvtcr ~<**t BWpttt (1) 

We discuss each in turn. The bandwidth per port is deter- 
mined by the bandwidth of the link and the router datapath. 
In short: 

BWpcn B xntoiBWrtnBWvute^ta-path) (2) 
where £ is the width of the dam path. The combined band- 
width of the B wires of a link is a function of Iho layout 
characteristics (e.g. total length)* chosen signaling tech- 
nique, and the budgets for power, delay, and area, a first- 
order expression for the bandwidth of a repeated global win 
optimised for power-delay is 

where F04 is the delay of an inverter driving four equally 
steed inverters [4]. In a 100 run technology, this yields 5 
Gb/s per wire under worst-case environmental conditions. 
Notice that the bandwidth of repeated global wires scales 
with technology because such wires allow (wave) pipelining 
at the segments. 

Running the router dam path at 5 GHz is not feasible. An 
aggressive but realistic frequency is 125 GHz corrapcnoV 
ing the clock frequency of 50k gates blocks [4]. The critical 
function in the data path is the N x N switch, For N up 
to 20 it meets the 1*25 OH2 data rate, using N 1-ont-of-JV 
multiplexors. Hie relaxed demand on the wires of the link 
can be used to redoce power dissipation and area. . 

The utilization factor, a, reflects the effectiveness of 
the router to resolve contention on the links, the queu- 
ing strategy, the queue sizes, and the schedule algorithm all 
strongly influence a. Accordingly, many queuing policies 
and scheduling algorithms have been presented in the oxer- 
amre. For example, a = 0.59 for infinite fifo input queues 
with uniform and iadepend ent traffic. (Virtual) output queu- 
ing gives a = 1 under the same conditions, but at die 
cost of larger queues and a more complex a cheduling algo- 
rithm [8]. Static scheduling techniques like (firne-divisiori- 
multiplexed) circuit switching can also improve the utihza- 
tinn factor. 

Hence, In 100 run technology, the bandwidth of a 32 bit 
router port is approximately 5 GB yxc/sea 

4*2 Cost 

Three main components contribute to the area oast of a 
router: the switch, the control logic, and the packet queues, 
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Tho switch allows N simultaneous connections from the 
i\T inputs to the JV outputs which results in B arrays of Jv*x 
J\T wires, giving rise to an ©(AT 2 ) area cost 

The control logic of a router is made up of the switch- 
matrix schedule unit and other configuration logic. The 
delay of a schedule cycle varies greatly per algorithm 
(for example, for virtual cutout queuing from 0(1) to 
0(JV*/*) (9J); it is important for two reasons. First, it de- 
tannines the lower bound tor latency that a flit 2 incurs to 
traverse trie router; Second, it affects the size of the queues. 
Hie longer a schedule cycle, ike more data arrive, given a 
fixed bandwidth of a pon BW^h* lids leads to deeper 
queues, and higher area cost 

The three aforementioned queuing strategies require 
queues of size 0(N) to 0(JV*) flits. Scheduling algorithms 
perform better with deeper queues, with a decreasing rctrtm. 

Besides routers, a significant amount of area is consumed 
by so-called network interfaces (NO modules. Thesemod- 
ule$ translate the IP transactions for a given connection to 
packets that are sent over the network, and vice versa. Pack" 
eta c$n be sent once the pay load has been completely ac- 
cepted by the NI. Hence, the buffers must be dimensioned 
such that, at least a complete packet for every simultane- 
ously active connection can be stored* 

Hie trade off between utilisation a and the cost is a com- 
plex on©, bat of importance to the viability of NoSs, 

5 The future role of busses 

In sections 1 and 2 we have argued that NoSs are essen- 
tial to solve SoS integration in a scalable fashion. While 
Section 4£ raised some general cost issues, we will now 
more concretely consider the trade off between bosses 
andNoSs. WW pacfcet-switahedNoSfi completely replace 
current busses in /Uture SoSs, or will a hybrid approach 
emt&? We believe drat shared busses may have a role 
to play in first-level cornmunicarion (B in Figure 2) for the 
following reasons* 

Rrst, typical IP blocks underotilize the bandwidth ca- 
pacity of an individual router port AD router ports offer the 
same bandwidth that is inherent to the architecture, whereas 
Che bandwidth requirements of IP blocks varies greatly. A 
shared memory module needs typically much higher (peak) 
bandwidth than a streaming peripheral device. Single word 
transfers, variable bit rates, bursty 10, and much lower clock 
rates for IP blocks than for the KoS further waste hand- 
width, Tins means that the communication needs of a num- 
ber of IP blocks can be aggregated using a bus before the 
capacity of a network link is reached, 

S Dccmd, network interfaces are mare expensive (in terms 
of area) than a bus adaptor; Using a bus as a first-level traf- 

*F& Stfiafe fi» foW CQWxCl ffififi; rfte «0mic pnmon of data handled 
pCTfic&cdttfe cycle. A packer to dectoflposgdlrt flits. 
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Figure 3. A sftareoVroedium bus seems a cost- 
effective way to connect the IP to the packet- 
switched network. 



fie concentrator, trading bus adaptors for network interfaces 
thus reduces the overall cost of IP-NoS interfacing. We ex- 
pect thai tho overhead of a bus and its network interface are 
outweighed. 

Finally, the number of routers is reduced significantly 
when busses am used as the first-level interconnect. Routers 
are larger than busses duo to their packet queues and more 
complex scheduling. We give an example below. 

An example of the heterogeneous communication archi- 
tecture is depicted in Figure 3, A router of arity three sur- 
rounded by twelve IP bocks is shown. Two shared-nrcdmra 
busses, each connected to six SOk gates IP Diodes* commu- 
nicate with tie router via two network interfaces. These 
have ewe functions: first they schedule the transactions on 
the bus, and second they given the bus clients access to the 
packet-switched network. Trie third port of the router pro- 
vides communication to the remainder of the network; Eig- 
uru 4 shows an architecture using Qnly routers. Now three 
routers of fixity five and one of arity four are needed. 

The suggested shared-medium bus has a length of 35fcA, 
where A is half of the length of a minimal transistor. Global 
wires of this length will not be the bottle-neck of bus per- 
formance. 3 

The feasibility of hybrid NoSs hinges on the light imple- 
mentation of the busses. First, they muse be shared wires, 
as opposed to switches. Second, their arbitration, must be 
combined, or at least compatible with, the scheduling taking 
place in the network interfaces* to offer uniform end-to-end 
network services. 

We see a future far hybrid NoSs, with first-level commu- 
nication over a shared-mediumhus, and the higher levels us- 
ing a packet-switched network. Perhaps a packet-switched 
network can be seen as a distributed and scalable implemen- 
tation of a logical bridge that connects all the local busses of 
the SoS. Deciding how many IP blocks can use a local bus 

^Mtoimuto-dfihiy wire scgmitotthaTOafe^cfSak^wirasesmeflB 
optimked fas r^t^dzhy pmdpc; have u Icngh °t 48W>, THeac tegti* 
fl&Z* Viflj technology lite the edge of SQlc btctks 14}. 
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Figure 4. IP to IP communication based on a 
homogeneous router network. 



before connecting to the rower network is a question that 
must be answered foremost - • • 

6 Conclusion 

We have argued in Section 1 thai ffliure systems on sili- 
con (SoS) win be composed of large numbers of process- 
ing nodes (or IP blocks); Bach processing node is xe]a, 
tively smaU (50k gates) to scaje with technology, and can 
be handled by CAD tools, assuming (heir evolutionary im- 
provement The interconnect and communication between 
these blocks (ben becomes an essential function in itself 
(Section 2X leading to networks on silicon (NoS), A NoS 
to based on packet switching to flexibly share Bnk capacity 
between the network clients, and to provide plurirorm com- 
mnnicatian services over a uniform infrastructure. Both ef« 



fonnaace, such as guaranteed throughput and latency, are 
Itnponant (Section 3), Efficiently combining thern is a chal- 
lenge. Section 4 showed that the performance of a NoS de- 
pends on many factors, but is eapecte4 to be high. The cost 
of a NoS can be state*} in terms of area (routers, network in- 
terfaces)* utilisation of wires, and speed (latency). They can 
bo traded off against one another, but also, perhaps more in- 
cercstmgly, against the cost of busses, A hybrid NoS using 
shared-wire bosses to communicate locally, and accumulat- 
ing traffic for a core routnr network is a promising architec- 
ture lhat d&sarves to be investigated. 
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ABSTRACT 

Managing fh& complexity of designing chips containing bfiKons of 
tranaistars requites decoupling computation from communication. 
For the communication, scalable and compositional interconnects 
Csuch as networks on chip (NoC)) must be used, hi this paper wb 
show that guaranteed services are essential in achieving this de- 
coupling- Guarantees typically come at the cost of inefficient re- 
source ntfliftnflrcp. To achieve efficiency* they moat bo used in com- 
bination with nest-effort services. We describe a NoC architecture 
that efficiently combines guaranteed and best-effort services. The 
toy element of our NoC is a router consisting conceptually of two 
parts: the so-called guaranteed throughput <Gt) and beat-efBort (be) 
routers. Both offer dam integrity* lossless and in-order data deliv- 
ery. Additionally, the CT router offers guaranteed throqghpnt and 
latency services, We combine the CT and BB router architectures ef- 
ficiently by sharing router resouices, enabling higJilinkutiHzatien. 
The guarantees are newer affected by the Be traffic, and links are 
efficiently utilized because BB traffic uses aU bandwidth left over 
from or traffic Connections ate programmed using BB packets. 
The programming model is robust, concurrent and distributed. It 
enables run-time and coropHe-time, deterministic and adaptive coi> 
nectfon management For all our architectural choices, we show the 
trade offs between hardware complexity and efficiency, and moti- 
vate oiflr ch nines. 

L INTRODUCTION 

Recent advances in technology raise the challenge of managing 
the complexity of designing chips containing billions of transistors. 
A key ingredient in tedding this challenge is decoupling the com- 
pmatfanjivm comnumiefpion [9, 15J. This decoupling allows IPs 
(die computation part), and the interconnect (the communication 
part) to be designed independently from each other* 

In this paper, we focus on the communication part. Existing in- 
terconnects (e,g ls busses) may no longer be feasible for chips with 
many IPs, because of the diverse and dynamic communication re- 
quirements. Networks m a chip (NoC) are emerging as an alter- 
native to existing on-chip interconnects because they (a) structure 
and manage global wires in new deep^bmicron technologies p, 
3,4*6], (b) share wires, lowering their number and increasing their 
utilization ft 6], (o) can be energy efficient and reliable (2]> and 
(d) are scalable when compared to traditional busses [7J. 

Pemnssfea to make digital or hard copies of aU or part of this work for 
pcrpflnaj or cfos&room n8o is granted without fco pxovidod that copies are 
not made or distributed for profit ex commercial advaabigo and that copies 
be^r m&$9 and the fun dcjtipn oa the tot page To copy otherwise, to 
icptib&rt), to part on servers cr coretfisttlbait to trots, requires prior specific 
pcVmisaoa aad/cr a fee. 

Copyright 2001 ACM X-XXXXX-XX-X/WXX ...55*00. 



Decoupling the computation from communication requires that 
services that IPs use to communicate (a) are well-defined, and 
(b) hide the implementation details of the interconnect 19] , see 
Figure 1(a). NoCs again help, because they are traditionally de- 
signed using layered protocol stacks [14], where each layer pro- 
vides a well-defined interface which decouples service Usage from 
service implementation [15, 3]. 

In particular, guaranteed services are essentia] because they 
make the rwjiiirements on rfce NoC explicit, thus limiting the possi- 
ble interactions (a stricter contract) of IPs with the communicaiioii 
environment. As a result, IP design is simpler. IPs can also be 
designed independently, because their guaranteed services are not 
affected by the interconnect or by other IPs. This is essential for a 
compositional construction [design and programming) of systems 
on chip. Moreover, for guaranteed services, tailures are restricted 
to the Tf configuration phase (a service request is either granted or 
denied by the NoQ which simplifies the IP programming mode [6}. 
We view die guaranteed services to be offered by an interconnect 
as a requirement from the applications, see Figure 1(b). 

The drawback of using guaranteed services is thai they require 
resource reservation for worstrcase scenarios. As a consequence, 
resources may not be efficiently utilized, which may not be ac- 
ceptable in a system on a chip where cost constraints are typically 
very tight, see Figure 1(c). lb overcome this problem, best-effprt 
services can be used for less critical communication reanircmcnls 
to fully utilise the available resources. Using best-effort services, 
however, provide no guarantees. 

A compromise between using guarantees only and having an ef- 
ficient interconnect Is to combine .guaranteed and. best-effort ser- 
vices. Guaranteed tra&c ahouloj not be affected by best^ffort traf- 
fic, while best-effort traffic may use all the resources not used by 




(a) , CO • 0» fc) 



Figure It Network services (a) hide the interconnect details and 
allow reusable components to be build on top of them, (b) ate 
driven by the application requirements, (c) their gfttefemy re- 
lies on technology and network organisation, and (d) me build 
using a layered approach. 
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guaranteed traffic Guaranteed services would then be used for the 
critical Jraffie recjurements , and best-eflbrt services for notwritol 
traffic requirements, 

Bx tins paper, we first list a set of network-independent comma- 
mcarian services that are essenftil in chip design. In the following 
sections, wo show the trade-offis between efficiency and cost that 
we make in onr NoC, In Section 3, we present the tateoflfcand 
take decisions on networfcrelated issues. In Section 4, wo zoom 
into the internals of the key component of our NoC: a router which 
elficienQy combines guaranteed and best*u¥ort services. 

2. SERVICES 

The increasing complexiiy of integrated circuits, and the strong 
time-to-marfcet pressure require modular designs and IP reuse. Da- 
coupling computation from communication in chip design serves 
both these two requirements [9]. This decoupling Is realized by 
defining communication interfaces that provide well-defined sar* 
vices and hide the impleinentaiion details of the interconnect. 

We show in Section I. that guaranteed services are essential to 
simplify Bp design and integration. Examples of such guaranteed 
services are data integrity, which assumes the data is delivered un- 
cgnnpted, lossless data delivery, which means no data is dropped 
in the mfercormect, in-order data delivery, which specifies that 
the order in which data is delivered is the same order in which it 
has been sent Other guarantees offer time-related bounds, such as 
throughput and latency. 

Guarantees require resource reservation for worst-case scenar- 
ios, which can be expensive. For example, guaranteeing through- 
put for a stream of data implies reserving bandwidth for its peak 
throughput, even whan its average is much lower; As a conse- 
quence, when using guarantees, re sources are often underutilized. 

Resources are better utilized when best-effort traffic is used. 
Best-effort services do not reserve any resources, and hence provide 
no guarantees. As & consequence, their performance is dictated by 
boundary conditions, such as interconnect load. For example, a 
connection may become temporarily lossy in a congested network, 
if the network resolves congestion by dropping darn. 

Best-effort services use resources well because they are typically 
designed for average-case scenarios as opposed to worst-case sce- 
narios. They are also easy and fast to use, as they require no re- 
source reservation. Their mam disadvantage is their unpredictabU^ 
ityz one cannot rely on a given performance 0.eu they do not offer - 
guarantees). In the best case, if certain boundary conditions are 
assumed, a statistical performance can be derived. 

Trie requirements for guaranteed services and tho efficiency con- 
straint (good resource utilization) are conflicting. But a first stop 
to a predictable and low-cost interconnect is combining the guar- 
anteed and best-effort services in the same mtexconnecL Guaran- 
teed services would be used for critical traffic requirements, and 
best-effort services for non- critical traffic requirements, for exam- 
ple a video processing IP win typically require a lossless, in-order 
video stream with guaranteed dntmgrnmt, but possibly allows con 
nrpted samples. Another example is cache updates which require 
uncorroptedg lossless, low-latency data transfer, bm ordering and 
gunrauifced throughput are less important. In Section 43 we show 
how combining guaranteed and best-effort services evidently uses - > 
common resources. In the remainder of this section we analyze the 
minimum level of abstraction at which the communication services 
must be offered to bide iho network inrernala. 

Traditionally, network services have been implemented and of- 
fered using a layered protocol stack, typically aligned to the ISO- 
OSI reference model [14]. NoCs also take this approach (2,3,6, 
1SL because it structures and decomposes the service implementa- 



tion, and the protocol stack concepts aid posfiaWehi^fi^h^ 
^ T b achieve the decoupling of corr^utation torn c^rnSic^on 
foe ttanrmmicatirm services must be offered at least at the level 
of the transport layer in OSI referenca model. It fa the first layer 
to 5^pj nd ' tl>,end 6Cnte?- . ^ network details; sec Fig. 

Tun lowest three layers in die protocol static namely physical, 
data-link and network layers, are network specific, Therefore, these 
services should not be visible to the IPs if decoupling between com- 
putation trom coninnmication is desired However; these layers ore 
essential in imrftmuuiting the services, because corseting miar- 
anrees without guarantees at the layer below is either very expan- 
sivev or even impossible. For example, jmpiflTt] P^rj n g a lossless 
cikmjniirif cation on top of a lossy service requires acknowledgment, 
dam retransrnlssicn^and filtering duplicated data. Has leads to a 
aotfficunt increase in traffic, and also a trade off between large 
buffer spac*reqmrementa and long delays. Evan worse, providing 
guarantees for time-related services is impossible if lower layers 
do not offer these guarantees, far example, throughput can not be 
guaranteed if comrau r n catio n at a lower layer is lossy. Asacon- 
sequence, guarantees can only.be built on top of guarantees, sea 
• Hgurolfb). Sirrnlarlyi a layer's efficiency is based on efficient im- 
plemerrtatjoas of the layers below it, see Figure 1(c). 

Tub NqC services that we consider essential for chip design are: 
data integrity. lossless data delivery, in-order delivery, ftroucjrmit. 
and latency. Data integrity is always guaranteed. An the other 
services can be guaranteed or apt. depending on request, In the 
next section, we describe briefly how these services are provided 
by our NoC, and in Section 4 we describe in detail how our router 
architecture enables an efficient implementation of these services. 

3. NETWORKS ON CHIP 

Currently, the prevalent on-chip interconnects are busses and 
switches [HQ. These are single-hop rnfcrowmeets, meaning that 
there is no storage in the irtferconnect Itself: Scalable interconnects 
require multiple hops with storage in every hop (pouter). This in- 
troduces a number of new issuer, which we discuss in this section. 

General computer network research is a mature research 
held [16] which has many issues in common with NoCs. How- 
ever, two significant differences between computer networks and 
on-chip networks make the trade ofis in their design very differ* 
encftj. First, routers of a NoC are more resource constrained than 
thoso in corrtpurer network^ in particular In the control cornplmrity 
and in the amount of memory. Second, cornrannicalion Bnto "f a 
NoC are relatively shorter than those inoaotuputt aetwedcs,a]]o^ 
ing tighr synchnmizaticn (network flow control) between routers. 

These two characteristics have a direci impact on the NoC ser- 
vice irnplemeatatioiL In a NoC, it is possible to solve the data in- 
tegrity at the data-link layer at a low cost We* therefore, assume it 
solved at ths network layer and higher: Lossless transport of data 
is guaranteed by our routers. However, to allow consumers slower 
than producers, th e network may be aQowed to drop data at its edge. 
Coiisequenrly, the designer may choose either for (a) a lossless con- 
nection (la^ taplexnenting end-to-end flow ctmtroj), or <b) a lossy 
connection G-C- without flow control). In-order delivery is again ' 
guaranteed by our router (le „ routers do not reorder data between 
a given input port and a given output port). Ecd-to-end ordering 
of data, however, has to be provided on top of this at the network 
edge when dam transported on dlfierent roams with different de- 
lays Offering guaranteed and best-effort throughput and latency 
services is also implemented by the routers. These router services 
together with the programming model explained in Section 4SJ2 
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We identify four important issues fn ths Resign of the router net- 
work architecture- These are: the switching mode, routing, con- 
tention resolution, and network flow control. Equally important, 
endrtthatd flow central and congestion control ore handled in our 
NoC at the network edge instead of the routes; we therefore omit 
their discussion hero. 

3.1 Switching Made 

The switching mode of a network specif! es how data and control 
are related. We distinguish circuit switching and /rater switching- 

In dicmt switching data and control are separated First the con- 
trol is provided to the network (connection set up). This results in 
a draft over which all subsequent data of tfce connection is trans- 
ported. In time-division switching bandwidth is shared by time- 
division multiplexing connections over circuits. Gtcuit-switched 
networks inherently offer time-related guaranteed services when 
resources are reserved during the o on rt Rc tion set up. 

In packet witching data is divided into packets and every packet 
is composed of a control part (the header), and a data part (the pay- 
had). Network rowers inspect, and possibly modify, the headers 
of incoming packets to switch the packet to the appropriate out- 
put port. Since in packet switching the packets are self contained, 
there is no need for a set-up phase to allocate resources. Best-effort 
service? are therefore naturally provided by packet switching. 

32 Routing 

Routing is the determination of the route (or path) that the data 
follows ftom source to destination. There are two basic approaches: 
source muting and destination muting. In source routing, che net- 
work interface at the source computes the complete route to the 
destination. In destination routing, only the network address of 
the destination Is specified, and every router selects the appropriate 
output based on the address, We refer to [17] for several classes of 
muting functions* 

In circuit switching; routing takes place irfconnectioa set up. i.e^, 
once for an data in mat connection. In packet switching, routing is 
done for every individual packet sent over the network. In both 
cases, source and destination routing are possible. We currently 
consider source routing because it is independent of the router net- 
work topology, which i s no t yet determined. 

3.3 Contention Resolution 

When a router attempts to send mul dpi e data items over the some 
link at the same time contention is said to occur. As only one dam 
item can occupy a link at any point in time a selection among the 
contending data mast be made; this process is called contention 
resolution. Three approaches exist avoiding contention, dropping 
data (one of ti» cf^gn flinc jtgm transmitted and tfre remain- 
der are deleted), and scheduling (or sequfintfoliring) data (all data 
items are sent in turn ; some data item s are therefore delayed). 

in circuit switching contention resolution lakes place at set up 
at the granularity of connections, so that data sent over different 
connections do not conflict. Thus, there is no contention during 
dqta transport, and time-related guarantees can be given. 

In packet switching contention resolution 
ularfty of In^vidual packets. Dropping partes is possible, bat far 
a lossless service (a) it adds cotnjiejrity to the network (ocfcnowl~ 
edgmenja, retransmission, etc), and (b) it ultimately increases the 
traffic because dropped packets need to be resent- Thus, sc he dulin g 
data is ffcft only remaining option. 
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3A Network Flow Control 08.10.2002 

Network flow control, also called routing mode deals with the 
limited amount of buffering in routers and data acceptance between 
routers. In circuit switching connections are set Up- Hie data S<?nd 
over these connections is always accepted by the routers and hence 
no network flow control is needed. In packet switching, data must 
be buffered at every router before they are sent on. Because routers 
have a limited amount of buffering they accept data only when thay 
have enough space to store the harming data. 

There are three types of flow control, namely store and forward* 
virtual cut-tkrougK and wormhole routing. In store-and-fcrward 
routing, an input packet is received and stored in its entirety before 
ic is forwarded to the next rouzen Thisrconires storage for toconv 
piece packet, end implies a par-router latency of at least the time 
required for the router to receive the packet 

ha virtual cinvthrongb routing a packet is forwarded as soon as 
the next router guarantees that the complete packet will be ac- 
cepted. Only when no guarantee is given, the whole packet is stored 
in the router. Thus, virtual cut-trough routing requires buffer space 
fox a complete packet, like store and forward routing, bat allows 
lower-latency communication. 

In wormhole routing packets are split in ao-calledJStff (tlow con- 
trol digits), A flit is passed to the next router when that router 
Bccepts that Hit, even when there is not enough birfler space for the 
<»inplete packet. As soon as a flit of a packet is sent aver an output 
port, that output port is reserved for flits of that packet only. When 
the first flit of a packet is blocked cfce trailing flics can therefore 
be spread, over multiple routers, blocking the intermediate links. 
Wormhole routing requires the least buffering (buffer flits instead 
of packets) and also allows low4atency communication, However, 
it is more sensitive to deadlock and generally results in lower link 
utilization than virtual cat-through routing. 

We opt rot wormhole routing because it offers low latency, which 
is one of our targeted services, and because it has die lowest cost in 
terms of buffering, which is expensive on-chip* 

4. A COMBINED GT-BE ROUTER 

Section 2 defines our requirements for NoCs in terms of services 
that are to be offered, m partial, both guaranteed and best-effort 
services. The previous section introduces a number of general net- 
working issues that will be built upon here. In the following two 
subsections we show thai the guaranteed and best-effort services 
can conceptually be described by two independent router architec- 
tures. Hie combination of these two router architectures is effi- 
cient and has a flexible programrning model, as described in Sub- 
section 43. 

4.1 A GT Router Architecture 

Our gnaranteed-traTJUghput (gt) router must guarantee oncor- 
rupted, lossless and ordered data transfer, and both throughput and 
latency over a finite time interval. As mentioned earlier, data in- 
tegrity is solved at the data-link layer: we do not address it farther. 
No data is dropped by the GT rout^ became we a vaiiaxit of cir- 
cuit switching (described in the next section). Dara is transported 
in fixed-size blocks, further explained below. As onjy one block 
is stored per input in, the gt router, blocks remain ordered. We 
now torn to the more challenging time-related guarantees, namely 
throughput and latency. 

4.1.1 Time-relaxed Guarantees 

Latency is defined as the time & packet spends ia the network. 
Guaranteeing .latency, therefore, means that a worst-case upper 
bound must be given for this time. Here we define throughput 
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for a given producer-consumer pair as the amount of daia tems- 
ported by the network over a finite, fixed time interval. Guarantee- 
ing throughput means giving a lower bound, 
• Wo observe that guaranteeing latency in a lossless water is dif- 
ficult because contention requires schermling and hence delays* 
Guaranteeing throughput is less problematic. Rate-based packet 
switching (for an overview see 118]) ctf em guaranteed tbnmgfcpnt 
over a tMte period, and hence a latency boTrnd. This bound is very 
high, however, sod the cost of buffering Is also Ugh, Deadline- 
based packet switching [13] offers preferential treatment for pack- 
ets close jo their deadline. This allows differential latency guaran- 
tees (under certain admissible traffic assumptions), but also at high 
buffer coses. 

Circuit switching solves the contention at set up, so naturally 
providing guaranteed latency and throughput. Circuits can be 
pipelined to improve throughput [5J, at the cost of additional 
buffering and latency. Ifrne-drvision multiplexing connections over 
pipelined circuits additionally offers flexibility in bandwidth allo- 
cation. This requires a notion of router synchronic^ which is pos- 
sible becansa a NcC is better controllable than a general network. 
We explain this variation in more detail in the next subsection. The 
associated programming model is described in Section 4*3.2. 

<M»2 Contention-free Routing 

A router uses a slot table to (a) avoid contention on alinfc, (b) 
divide up bandwidth per Erik, and (c) switch data to the correct out- 
put Every slot cable JR has S fixed-size time slots (rows), and JV 
router outputs (columns). There is a logical notion of synchronic* 
iry • all routers in the network are in the same slot In a slot * at 
most one block of data can be read/write per input/output port Hie 
next slot (e+l)%5, the read blocks axe written to their appropriate 
output ports. Blocks thus propagate in a store and forward fashion. 
The latency a Mock incurs per router is equal to the duration of a 
slot. Bandwidth is guaranteed in multiples of block size per 3 slots. 

The entries of the slot table map outputs to inputs for every slot; 
ft(s,o) = i . An entry is empty. When there is no reservation for 
that output in that slot No contention arises because there is at most 
one input per output Sending a single input to multiple outputs 
(multicast) is possible. 

The slots reserved for a bio ck along its path from source to desti- 
nation increase by one (modulo &)• If slot # is reserved in a router; 
slot (j + 1)%S most be reserved in the next router on the path. 
The assignment of slots to connections m the network is an opti- 
mization problem, and is described in Section 4.33. Section 4.3.2 
explains bow slots are reserved in the network, by means of best- 
effort packets. 

4.2 A BE Router Architecture 

Best-effort Can) traffic can have, a better overage performance 
man offered by guaranteed services* This depends on boundary 
Oftiwfffrfaifo guch as network toad, that are inroredictable. Best- 
effort services thus fulfill our efficiency requirement, but without 
offering time-related guarantees, Tins section describes an archi- 
tecture for a best-effort service with Uncoxnmted, lossless, in-order 
dam transport. 

- The router efficiency is influenced by both its complexity and 
its utilization, In Section 3 we have justified our choice for rout- 
ing (source routing) and network flow control (woimhole). Now 
we determine the contention resolution scheme that is used. It has 
rwo components: buffering and scheduling. Our router prototypes 
show that the buffering costs dominate the cost of the router. The 
main trade affin Section 4wZ 1 is therefore between buffer costs and 
hnk utilization, which are both critical resources. For the chosen 



buffering strategy an efficient scheduling ol^J^^og^ted in 
Section 4£2t trading off Unit utilization ami schedule complexity. 

42.1 Buffering Strategy 

Tits buffering strategy dBcarrnines the location of buffers inside 
the router. We distinguish input queuing, output queuing, and vir- 
tual o&pvt queuing, in the inflowing, JV is the number of inputs 
(equal to the number of outputs) of the router. Wo believe that in 
a balanced solution the rates at which routers and links operate is 
equal, Slower routers reouire more buffering, and faster routers are 
not feasible as links operate at high speed. 

In irrput queuing there is a single queue par input, resulting in 
the lowest buffer cost (JV logical queues in N physical memories) 
of ell three approaches. However, due to the so-called hea&vf- 
linc blacking, for large JV network utilization saturates at 59% [8], 
Therefore, input queuing results in weak utilisation of rite links, 

Output queuing can increase the link utilization to 100% by hav- 
ing JV queues ateacbriutmn, or JV 3 queues, with as many physical 
memories. It is better to nave t3ewer larger, memories than more 
smaller memories because the overhead of small RaMs is very 
high. Overr^ocking the router by a factor JV to use iV memories 
is not possible, as argued previously v So the number of memories 
depends qnadraticaDy on JV, hence output queuing is not scalable. 

Virtual output queuing HI (voq) combines the advantages of 
input queuing and output queuing. It has the buffering complexity 
of input queiring and the link utilization of output owning. As for 
output queuing, there are JV 2 logical queues, but they are combined 
in JV physical memories at the inputs as for input queuing. For 
every input i there are JV queues Q (», o) , one for each output o, see 
Figured. There is at nw^o^e write w these c^eTies. Tr^ difference 
between output and VOQ is the additional constraint that there can 
be at most one read from this group of JV queues. CTrns enabled the 
mapping of an input queues of the same input to one memory.) TMs 
additional constraint has to be taken into account by the scheduling. 
100% link utilization can still be achieved, when JV is large [12]. 

Wo select VOQ because it combines high link utilization with 
moderate buffer costs. 

Matrix Scheduling 

This section shows how Knk contention and memory contention 
(imposed by VOQ) are resolved Matrix scheduling solves both 
krnrfr of contention by ensuring that every voq memory is read at 
most once, and every output dink) is written to at most once. The 
scheduling problem can be modeled as a bipartite graph matching 
problem as follows. Every input porta is modeled by anode %h and 
every output pert o by a code v fl . There is an edge between in and 
tfe if and only if queue Q(i,o) la non-empty. A match is a snbset 
of these edges such that every node is incident to at most cms edge, 
For example. Figure 3(c) is a match of Hgure 3(a). Trie number of 
edges in die match is its size; a match is maxima! when no edges 
can be added to it. A maximum size match is a largest size match. 

Although optimal, there are two reasons not to consider only 




Figure Zi Schematic of a router using virtual output queuing. 
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Figure 3: The ttaee steps of a single iSLlPitenitton, 

maximum size matches. First, maximum size matching algorithms 
have 0{N* /2 ) complexly Since matrix scheduling is done at flit 
rare this is not feasible for large N. Second, maximum size match- 
ing algorithms can he tm&fr, which cm result in starvation, ie.. 
same queues are never served. 

There ere several matching algorithms; sea Hi J far a thorough 
discussion. We select the iterative SLIP (iSLUO matrix scheduling 
algorithm 111]* because it has a low complexity, avoids starvation, 
and provides increasing performance as the number of iterations 
grows. It reaches a maximal match in log^ (JV) iterations. Even a 
single iteration considerably outperforms input queuing, andean he 
efficiently inu^ernented in hardware. Multiple iterations increase 
the latency of ihe control path, and hence the ffi ate (as explained 
in Section 4*3- i)» We consider using 1-SLIP because multiple iter- 
ations give only marginal improvement 

A single iSUP iteration has three steps, flhistratsd by an example 
in Figure 3 for N « 4. Si the first stage, see Figure 3(a), every non- 
empty queue Q(£»o) requests access to output pott o from input 
port i. In the second stage, see Figure 3(b), every output port o 
grants one request, solving link; contention at the output ports. In 
the third stage, see Figure 3(c), every input port i accepts one grant, 
to resolve memory contention at the input pore Wc extend tSUP 
to take network flow control into account 

43 Combining the GT and BE Routers 

The QT and be router arcMiectures are combined id share re- 
sources, in particular the links, memories, and switches. Moreover, 
best-effort traffic enables a packet-based prograinnring model for 
Hie guaranteed traffic, as shown later, in Section 4*3.2. 

The principal constraint for a combined router architecture is that 
guaranteed services are never affected by bestre^on services. Fig- 
ure 4(a) shows that, conceptually, the combined router contains 
botu router architectures (fat lines represent data transport, tain 
lines represent control transport). Incoming data is switched to ei- 
ther the ct or the be router. Ihe Gt traffic (the traffic that is served 
by the GT router) has the higher priority, to maintain guarantees. 
This is ensured by the arbitration unit, which therefore affects the 
best-effort scheduling. Funhennore, best-effort packets can pro. 
gram die guaranteed router, as shown by the arrow labeled pro- 
gram. Thin lines going from the right to the left indicate network 
now control, which is only inquired for best-effort packets because 
guaranteed blocks never encounter conientian. 

On a shared link only one BB or GT data item can arrive or be 
sent at any point in time, thus GT and be meawries can be shared, 
keeping the number of memories at N, with N~hN 2 logical queues 
in total Rgnre 4(b) shows that the data path consisting of memo* 
ries and switch niacd* is shared and that the control parhs of the Bb 
and GT routers are separate, yet interrelated. Moreover, the arbitra- 
tion unir of Kgure 4(a) has been absorbed by the BB router. The 
following subsection shows how this can be done. 

*3.i Arbitration and Flit Size 
When combining gt and BE traffic in a single rrerwork the irn- 
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Figure 4: Two views of the combined GT-BJ2 router. 

pact on the network flow control scheme must he taken into ac- 
count. Recall from Section 3.4 that a BB flit is the smallest unit or 
which flow control is performed, In other words, the BE Scheduling, 
using SSLIP, can only react to CT blocks at ntt granularity. Tb avoid 
aUgnment problems, the block size (B words) is a multiple of the 
flits (F words, with B s» IF). Z is constant; we prefer a small I and 
F to decrease the store-anoVforward delay for guarameed traffic. 

We extend iSUP to handle the corrirnnation of GT and be traffic, 
hi this combination gt traffic always has priority over be traffic, 
litis is to ensure that guarantees are neve? corrupted. 

4.3.2 Programming Model 

In this section we show how GT connections are set up and torn 
down by means of be packets. To ensure scalability, programming 
must not require global or centralized resources. Section 4*1.2 ex- 
plains why our coni^on-ftee routing uses slot tables; we now see ' 
that they are distributed over renters for scalability. 

Initially the slot table of every router is empty. This means that 
GT connections can only be set up using &Jb packets, unless an ad- 
ditional comrnunication infrastructure is introduced solely for pro- 
graming Ito special packets, Reset and Stait are used to reset 
and stan die NoC, respectively. They progress by flooding, and are 
not subject to the usual network flow control. We will not discuss 
them further. There are three system packets: SetUp, TearDown, 
and AckSefUp. They are used to program the slot table in every 
renter on their patiL 

The SetUp packet Is used to create a connection from a source 
to a destination, and travels in the direction of the data ("down- 
stream"). AckSetUp acknowledges a successful set up, and flows 
upstream. The ToarOown packet destroys (partially) existing con- 
nections, and can travel in either direction, Setup packets contain 
the source of the data, the path to their o^stioariocu and a slot num- 
ber. Every router along the path of the SetUp packet checks if the 
output to fixe next router in the path is irea m tlu) slot indicated py 
the packet If it is free, the output is reserved in that slot, and the 
SetUp packet is forwarded with an incremented (modulo 5) slot, 
Otherwise, the SetUp packet is discarded and a Tea/Pawn packet 
returns along the same path. Thus every path must be reversible; 
this is the only assumption we make about die network topology. 
These upstream TaurDown packets free the slot, and continue with 
a decremented sloL Downstream TearDown packets work smrflarty, 
and remove existing connections. A amneccion is successfully cre- 
ated when an AckSet Up is received, else a TearDown is received. 

The progrormriirig model is pipelined and^cpneurrent (multiple 
system packets can be active in the network simultaneously, also 
from the same source) and dxstrirrnted (active in multiple routers). 
Given the distributed nature of the programming model* ensuring 
consistency and deterrrn^mis credaL The outcome of program-, 
rmrtg may depend on tie execution order of system packets, but is 
always consistent. Tnenext section shows how to use the program* 
rning model* 
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433 SlotAUocaMon 

This section explains ways to determine the slots specified in 
Setup packets. A slot allocation for a single connection requires 
that, at every router along the path, the required output is free in the 
appropriate slot. Therefore, interference of SeiUp packets of mul- 
tiple connections can be completely avoided if connections are set 
up with conflict-free slots or paths. All execution orders of SotLfp 
packets then give &e same result* 

Computing an optimal slot allocation is complex and requires a 
global network view. It can bo used only for small problem In- 
stances. Tb rerfiiqe computational cost, heuristics can be used, but 
ftis probably leads to non-optimal solutions. Compfle-tune slot al- 
locations fiom both approaches can bo recreated deteammsticaOy 
at run time, conCTrrenily and distributed/ (because all SetUp pack- 
ets are confiict-free). 

At run time, a global view requires a cennuBzcd slot allocation. 
Tins impairs scalability and slows c>wn programming. Run-time 
distributed slot allocation is scalable, but lacks a global view. This 
typically results in auboprimnl slot allocation. Moreover, SatUp 
packets may interfere, malting programming more involved, and 
perhaps nim-detexminifllic. However, dynamic connection man- 
agement at high rates will require distributed slot allocation. In 
a simple distributed greedy algorithm, all sources repeatedly gen- 
erate random slot numbers for each set up until their connection 
succeeds. 

We conclude that onr pmgfffw*rTitng model allows both compile- 
rime and run-time slot allocation. Computational complexity, de- 
terministic results, and scalability can be balanced according to sys- 
tem requirements, 

5. CONCLUSIONS 

Managing the complexity of designing chips containing 1 
of transistors requires decoupling computation from cob 
don. For communication, networks on chip (NoC) are i 
as an alternative for existing interconnects to solve technological, 



In this paper we show that guaranteed services are essential to 
provide predictable interconnects that enable compositional Systran 
design and integration. However, guarantees typically utilize re- 
sources inefficiently. Best-effort services overcome tins problem 
but provide no guarantees. So, combining guaranteed and best- 
effort services allows efficient resource utilization, yet still provid- 
ing guarantees for critical traffic 

l ime- related guarantees, such as throughput and latency, can 
only be constructed on a NoC that intrinsically has these properties. 
We therefore define a router-based NoC architecture that combines 
gnaranteed and bear-effort services. Thus, the router architecture 
has conceptually two parts; the gnaranteed throughput (GT) and 
best-effort (Be) canters. Both offer data integrity, lossless data de- 
livery, and in-order data delivery. Additionally, the GT router offers 
guaranteed, throughput and latency services using pipelined circuit 
switching with time-division rnultiplexing. This requires a notion 
of syncbronicjry: at each time slot at most one block of data is com- 
municated over a link. The gt router has low latency and moder- 
ate memory requirements. The BE router uses packet a witching, 
vmrmhole routing, and virtual ourput opening with iSLJP. The BE 
router has low latency, high Knk utilization, and rncderate memory 



using be packers. The programming model ig 
and distributed. It enables run-time and co 
tic and adaptive connection mfmagemera. 

For all our architecture choices, wc show the trade cSs between 
hardware complexity and efficiency, and motivate our choices. 

In conclusion, we describe and motivate a combined gnaranteed 
and best-effort router, which Is an essential component in a NoC 
Is fulfills our requirements by providing guaranteed services, and 
satisfies the efficiency constraint by good resource utilization. 

& REFERENCES 

[I] M. Ali ajyita Youssefi, Tha performance analysis of an Input access 
schemata a high-speed packet switch, hi INFCCOM'9 J. 

[2] L» Banim and G. Dc Micheli. PowennjnetworJs on eWps. InlSSS'OL 

(3] L. Benin* and G. De Micheli, Networks on chips: A new SoC 
paradigm. IEEE Computer, 35(1)^70-80, 2Q0&. 

[4] W. J. DaDy and B, Tbwles. Rome packets, not wires: On-chip imer- 
coxmecdon networks. In DAC'01, 

[5] A, dcHcn. Robust* high-speed network design for large-scale multi- 
processing. Thchaical Report 1445. MIT, AI Laboratory, Sept. 1993. 
. £6] K. Goosseos, J. van Mewbergea, A. Pfcctora, and P. Widogo. Net. 
woric on silicon: Combining begi-etfbrt end gremmrry*! services. In 
DATE'02. 

[7] P. Gucnisr and A. Grdncx. A generic ait*ii!«*ure for on-chip pac&et- 

swltzshed interecaxqecqons. InDATB'OO. 
[81 M. J. Kami, 1*. G. Hhzchyj, and S. P. Morgan, Input versus ounror 

queuelng cm a spsce-divl$lpai packet switch. IEEE Trwiz. on Canum* 

nteatons* GOM-35(12):1347-135& 1987. 
[9] K. Keutzeiv s. MaHk, A, R* Newton. J. M. Rafaaey, and. 

A- Sangicvanni-WncenWU. System-lev el dayagn; Orthogonal watfon 

of wmcema and pl&uorm-based design. JEEE Ikons, on CAD ofIntt> 

grand drrudtf and Systems, 19(12):1523-1S43, 2000. 
[10] J. A. Ldjten, J. L. van Meerbergen, A. FL Tnnmer, and J. A. 

less. Propftiri, a data-driven malri-proeessar archicectura for high- 

per&rmanoft DSP. In ED&TC1997. 

[II] N. McKeown. Schedule Al^riAmr for Input-Queued Cdl Switches. 
PhD thesis, tidv, of California, Boikcl^ 1995. 

[123 N. McKeown, A. McHottflnu. V. Aqandraranv and I. Walrand. 
Achitsving 100® ttanughpat in an inpar-oueocd switch. JEEB Trans, 
on Communications* 47(S):1260-|272, 1999. 

[19] I. Rodbfd. Tbfformg Router ArthUtcturzj to Performance Require- 
ments in Cut-Thrtugh Networks. PhD iheste, Uuiv, cfMichjgcm, 1999. 

[14J M. T. Rose. The Open Booh A Practical Perspective on OS1* 1990, 

[15] M. Sgorit M. Sheets, K. Keatzet; $. Malik, 1. Ra^aey, and 
A. San^ovainu^noaitelO. Addtessing the systera-on-o-cfa'p inter- 
: connect woes ihrongh ccmnruoicadon-tosd design. In RAC'OJ. 

[16] A. S. Tcmcnbaam Computer Networks* 1996. 

[17] A. Nfenna and C Raghaveadn. 2ntercajme<tfm Networks fvr Multi- 
processors and MulticomparerSf Theory and Practice. 1994. 

[18] H. Zhang. Service diseipllpea for gnaranteed pedhnnanoe service in 
packcfrsvnrching aerwcikft. Ptoc, of toe IEEE, S3(10):1374-96\ 1995. 



We combine the GT and be router architectures efficiently by 
sharing router resources. Hie guarantees are never affected by the 
Bfi unfile, and links are efncienfiy utDi2ed because BE traffic uses 
all bandwidth left over by GT traffic Connections are programmed 
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1 „ Integrated circuit comprising a plurality of modules, and a network arranged 
for transferring messages between the modules, wherein a message issued by a 
module comprises first information indicative for a location of an addressed module 
within the network, and second information indicative for a location within the 
addressed module, 

characterized in that the first and the second information are arranged as a single 
address from which the network determines winch module is addressed, and ftom 
which the addressed module determines which of its locations is selected. 

2. Method for exchanging messages in an integrated circuit comprising a 
plurality of modules, the messages between the modules being exchanged via a 
network, wherein a message issued by a module comprises first information indicative 
for a location of an addressed module within die network, and second information 
indicative for a location within the addressed module, 

characterized in that the first and the second information are arranged as a single 
address ftom which the network determines which module is addressed, and from 
which the addressed module determines which of its locations is selected. 

3. Integrated circuit comprising a plurality of processing modules and a network 
arranged for providing at least one communication between a first and a second 
mdule, which communication channel supports transactions comprising outgoing 
messages from the first module to the second module and return messages from the 
second module to the first module, characterized in that the network manages the 
outgoing messages in a way different from the return messages. 

4. Method for exchanging messages in an integrated circuit comprising a 
plurality of modules, the messages between the modules being exchanged via a 
network, wherein a communication channel through the network supports transactions 
comprising outgoing messages from the first module to the second module and return 
messages from the second module to the first module, characterized in that the 
network manages the outgoing messages in a way different from the return messages. 

5. Integrated circuit according to claim 3, wherein the network has a first mode 
wherein a message is transferred within a guaranteed time interval, and a second 
mode wherein a message is transferred as fast as possible with the available resources, 
wherein the outgoing transaction is a read message, requesting the second module to 
send data to the first module, wherein the return transaction is the data generated by 
the second module upon this request, and wherein the outgoing transaction is 
transferred according to the second mode, and the return transaction is transferred 
according to the first mode. 
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6. Integrated circuit according to claim 3, wherein the network allows at least 
two of the following transaction modes unordered, locally ordered and globally 
ordered, wherein an unordered transaction mode of the network gives no guarantees 
for the order in which messages will arrive at their destination, a locally ordered 
transaction mode guarantees that messages sent to the same destination will arrive in 
the same order as they were sent, a global ordered transaction mode guarantees that 
messages will arrive m the same order as they were sent even if they are sent to 
different destinations, wherein outgoing and return transactions are handled according 
to different transaction modes. 



7. Integrated circuit according to claim 3, wherein the network reserves a first 
and a second buffer space for the first and the second module respectively, the 
bufferspaces having a mutually different size, 

8. Integrated circuit comprising a plurality of modules, which modules are 
arranged to communicate to each other via a network, wherein the network is 
arranged to distribute a message from a first module to two or more second modules, 
and wherein the second modules are arranged to generate an acknowledge message 
indicating receipt of the message from the first module, 

the network being arranged to generate a single return message to the first module, in 
dependence of the acknowledge messages of the second modules. 

9. Integrated circuit according to claim 8, wherein the single return message 
indicates that at least one of the second modules has received the message issued by 
the first module. 

10. Integrated circuit according to claim 8, wherein the single return message 
indicates that each of the second modules has received the message issued by the first 
module. 

11. - Ihte^t^.carcuit comprising a firpt plurality of processing modules and a j 
network, the network comprising a second plurality of nodes and interconnections 
between nodes, the network being arranged for transferring messages between a first 
and a second modules via a path through the network, the processing modules coupled 
to the network via a network interface having a buffer for receiving incoming 
messages, wherein a message from a first to a second module is not initiated until the 
buffer has sufficient space for receiving a return message from the second module. 
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Abstract 

Networks are emerging as a possible solution for onr 
chip interconnects. In ibis paper, we describe how net- 
works on chip (NoC) are similar to and differ from 
both off-chip networks (e.g., computer networks) and cur* 
rent on-chip interconnects (e.g., buses). We re-examine 
• the communication services in the context of NoCs. We 
provide services th$e abstract from network implementa- 
tions enabling a clean separation between the NoC and 
IP blocks. We define a request-response transaction model 
similar to bus protocols, malting our approach back- 
ward compatible, To exploit the full power of NoCs, 
we also provide connection-oriented communication with 
differentiated services. Examples are bandwidth guaran- 
tees* transaction orderioge, and end-to-end flow control. 

Key Words: Networks on chip, on-chip buses, computer 
networks, communication services, protocol stack, 
transaction, connection. 
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